<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
    xmlns:admin="http://webns.net/mvcb/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:content="http://purl.org/rss/1.0/modules/content/">

    <channel>

    <title>The Hardball Times -- Peter Jensen</title>
    <link>http://www.hardballtimes.com/main</link>
    <description>Baseball. Insight. Daily.</description>
    <dc:language>en</dc:language>
    <dc:creator>studes@hardballtimes.com</dc:creator>
    <dc:rights>Copyright 2013</dc:rights>
    <dc:date>2013-05-17T08:57:15+00:00</dc:date>
    <admin:generatorAgent rdf:resource="http://www.pmachine.com/" />


    <item>
      <title>2010 DIRVA_PLUS pitching runs</title>
       
<link>http://www.hardballtimes.com/main/blog_article/2010&#45;dirva_plus&#45;pitching&#45;runs/</link>

<guid>http://www.hardballtimes.com/main/blog_article/2010-dirva_plus-pitching-runs/#When:19:41:15</guid>
       
<description><![CDATA[<br /><br /><a href="http://www.hardballtimes.com/main/downloads/" target="new">Click here</a> to learn about THT's download subscriptions.]]>

</description>
      <dc:creator>Peter Jensen</dc:creator>
      <dc:date>2010-11-19T19:41:15+00:00</dc:date>

    </item>

    <item>
      <title>Yet another pitching metric</title>
       
<link>http://www.hardballtimes.com/main/article/yet&#45;another&#45;pitching&#45;metric/</link>
<guid>http://www.hardballtimes.com/main/article/yet-another-pitching-metric/#When:08:34:15</guid>       
<description><![CDATA[It’s been quite an offseason for pitching metrics.  First, there was the controversy over the 2009 Cy Young award voting.  Then there were some <a href="http://www.insidethebook.com/ee/index.php/site/comments/mike_silva_chronicles_part_4_fip/" target="new">interesting discussions over the Christmas holidays at The Book Blog about FIP</a> in response to the 10 questions asked by Mike Silva.  Followed by Baseball Prospectus’ announcement of a new pitching metric called SIERA.  Followed by even more discussions about SIERA and pitching metrics in general at both The Book Blog and BP.<br />
<br />
If you have bothered to read this far you have probably followed at least some, if not all of this.  So why should I clutter the landscape with yet another pitching metric when there are already DIPS, LIPS, FIP, xFIP, RA+, tRA, tRA*, PZR, SIERA, and others crowding for your attention?  Well, there seems to be a need for one, because I think we may have taken a wrong turn when DIPS was introduced.<br />
<br />
Don’t get me wrong, <a href="http://www.baseballprospectus.com/article.php?articleid=878" target="new">Voros McCracken’s idea</a> that we know almost everything we need to know to predict a pitcher’s future performance by actually leaving out information and considering only those events over which the pitcher has complete control was inspired.  The wrong turn to which I referred was that predicting future ERA was how we should measure the success of the new metrics.<br />
<br />
Why in the world would we want to predict a pitching statistic that we know is heavily influenced by both the quality of the pitching team's defense and luck?  Isn't the pitcher's "true talent" what we really want to predict?  It's as if when James introduced Runs Created and Palmer and Thorn introduced Batting Runs, that we tested to find which was better by seeing how well they predicted RBIs.  Of course, we did something equally stupid and tested to see how well they predicted team runs, but we’ll leave that rant for another post.<br />
<br />
The single inspiration for beginning to work on a new metric came from <a href="http://www.insidethebook.com/ee/index.php/site/comments/mike_silva_chronicles_part_4_fip/#25" target="new">this comment by Matthew Cornwell</a> at The Book Blog in response to Mike Silva’s question about FIP:<br />
<blockquote>Which "leaves out" more?<br />
<br />
ERA - defensive support, leveraging/quality of batters faced, park factors, bullpen support, the pitcher’s responsibility regarding unearned runs <br />
FIP - event timing/sit. splits/LOB%, etc., what BABIP skill does exist, pitcher defense, DP inducing, HBP, WP, leveraging/quality of batters faced, park factors, XBH prevention, pick-offs, controlling running game.<br />
<br />
Most pitchers can’t prevent enough runs by controlling the running game or limiting doubles or defending their position well in any given season to make a huge difference in their FIP or ERA.  That is why FIP works so well at a seasonal level - it leaves in what pitchers control the most, and as a fair trade-off for most pitchers (the Glavine’s being examples,) takes out what is least impactful and controllable.  However, over 15-20 seasons, those secondary run prevention tools add up to be tons of runs for many pitchers.<br />
<br />
Take RA+ - if you could just adjust for defensive support you should get pretty close to “true” RA+ for long career guys.  BABIP and HR/FB have had enough PA’s to stabilize, leveraging and quality of batters faced is not a huge factor for modern starters, park is considered already, and bullpen support tends to be a smallish factor for most pitchers over long careers.  Outside of defensive support, what else would dramatically skew a long- tenured pitcher’s “real” RA+ level? <br />
<br />
I guess my point is, given a very long career, some defensive-adjusted RA+ would be better than FIP or ERA.  And then use FIP for future performance and evaluating pitchers with only a handful of seasons under their belt.  FIP definitely is very useful.  Like many have said, it does what it is intended to do.</blockquote>Matthew’s comment seemed to concisely express the unease that many were feeling.  Were existing metrics leaving out real skills that some pitchers possess and are important to success?  This has been a common complaint about all the DIPS-related metrics since DIPS was introduced, and seems to stem from the confusion between metrics that are primarily descriptive, and those that are primarily predictive.<br />
<br />
As I read Mathew’s list of complaints for FIP above, I realized that one of my favorite methodologies could be fashioned into a pitching metric that would correct some of the deficiencies that he identified.  Such a metric would be close to the ideal descriptive metric and also might lead to new ideas for improving predictive metrics.<br />
<br />
That methodology is RVA, or Run Value Added&mdash;you may be more familiar with its Fangraph name of RE24.  Run Value Added was introduced by Gary Skoog in an article in the 1987 Bill James Baseball Abstract.<br />
<br />
The concept is simple.  Take the run value (from the RE table) for the baseout state that exists at the beginning of a play, and subtract it from the run value at the end of the play, plus the number of runs that scored on the play.  For batters this is the best descriptive metric of the run value that he adds during a PA.  It would be very close to perfect if it wasn’t for the problem of apportioning the value of extra bases taken by the runner.<br />
<br />
As a predictive metric RVA is not as good as Linear Weights (which is simply the league-average RVA for an event) because there is little indication that batters are able to control their hitting by baseout state.<br />
<br />
RVA has never been used as a pitching metric because it zeros out for each inning.  Actually it doesn’t zero out at zero, but each inning’s RVA is simply the number of runs scored in the inning minus the league average runs scored per inning.  Therefore, using runs allowed by a pitcher is much simpler and just as accurate.<br />
<br />
But, If we separate the RVA that occurs on the DIPS events -- HRs, Ks, NIBBs, and HBPs -- from the non DIPs events all non-HR hit balls except ROEs and safe FCs, and then processed each group separately, we could have a very good descriptive pitching metric.  And that would be DIRVA+.<br />
<br />
DIRVA without the plus is an acronym for Defense Independent RVA, and is the pitcher’s RVA for the DIPS events, minus what an average pitcher would be expected to have totaled for DIPS events RVA for the same number of innings.  The plus in DIRVA+ is the run value of the non-DIPS events, minus his team’s average run value for those events for the year.  This basically subtracts out the value that the defense adds and compares the run value the pitcher adds to what an average pitcher on his team adds.  <br />
<br />
Is this a perfect measure of a pitcher’s control over his hit balls?  No, for several reasons.  The team’s pitching staff as a whole may be better at preventing runs on hit balls than the league average.  But DIPS theory says that no individual pitcher has much control over his hit balls, so in the aggregate a staff of pitchers should have a variance in this ability even closer to zero.<br />
<br />
A more problematic concern is that a particular pitcher’s hit-ball location distribution may vary from the staff average, and either have more balls hit to the better defenders or the poorer defenders.  This problem is correctable, but only with accurate hit-ball location data and much programming.  I opted for the less accurate but simpler method that could be used for datasets without any hit ball location data.<br />
<br />
So what does DIRVA+ do that other pitching metrics don’t?  Other metrics use Linear Weights to calculate the run values of events.  Using RVA has the advantage of including the sequencing of events.  If a pitcher pitches better or worse with men on base, DIRVA+ will show it where other metrics would not.  If a pitcher has a high LOB or induces more-than-average DPs or lower-than-average XBHs DIRVA+ will also show it.<br />
<br />
Of course, there is a lot of luck in a single year’s LOB, or RISP average, or DP rate, or XBHs. But this is a descriptive stat, so including the luck is OK.  DIRVA+ also includes a measure of whether a pitcher is better than average at preventing runs on balls in play independent of his team’s defense.  Some believe this is luck, others believe this is a pitcher skill. DIRVA+ doesn’t care because it is a descriptive statistic and both luck and skill count.<br />
<br />
Looking at Matthew’s lists above, what doesn’t DIRVA+ do?  It’s a pitching stat, so it leaves pitcher defensive value to the defensive metrics.  Wild Pitches, pickoffs, pickoff errors, and balks are included.  Steals and passed balls are not, because the catcher has a significant portion of responsibility for them.  DIRVA+ is not adjusted for parks since it is a descriptive stat.  The starting pitcher’s RVA is completely independent of his bullpen support.  The numbers that I will be presenting don’t adjust for the quality of the hitters.<br />
<br />
On the whole DIRVA+ satisfies most of Matthew’s requirements, but not all of them.  The only thing left to do is show you who the stat thinks were the best pitchers in 2009.<br />
<br />
And the results are:<br />
<pre>Top Ten DIRVA+ Starting Pitchers 2009
Pitcher                   Innings   Exp Runs   DIRVA Runs  Hit Ball Runs  DIRVA+
1. Zach Greinke              229       15.7        -41.6         -6.6     -63.9
2. Roy Halladay              239       16.4        -18.1        -11.0     -45.5
3. Adam Wainwright           233       16.0        -20.9         -7.9     -44.8
4. Tim Lincecum              225       15.5        -30.6          2.8     -43.3
5. Jair Jurrjens             215       14.8         -2.6        -25.8     -43.3
6. Chris Carpenter           192       13.2        -18.1        -10.5     -41.8
7. Javier Vazquez            219       15.1        -22.0         -3.9     -41.0
8. Dan Haren                 229       15.7        -11.2        -14.0     -40.9
9. Wandy Rodriguez           205       14.1         -2.5        -16.3     -32.9
10. Ubaldo Jimenez           218       15.0         -8.9         -8.5     -32.4</pre>Looks about right.  Just to confuse you, I am sticking to my plan that I used in my BZM defensive metric, and I am showing runs saved by any defensive team member as a negative number.  So the Grienke’s -63.9 runs DIRVA+ total is really, really good and not really, really bad.<br />
<br />
These are actual runs saved over the course of the season and converted to wins at the same approximate 10-runs-per-win rate used for offensive players.  The innings number shown in the table are innings as a starting pitcher, with the fraction portion lopped off.  Exp Runs is the number of DIRVA runs a league-average pitcher would have for the same number of innings. DIRVA runs and Hit Ball runs are calculated by the methodology that I have described above, and DIRVA+ is just the sum of DIRVA runs and Hit Ball runs minus the Exp Runs.<br />
<br />
Look at the DIRVA+ run values for No. 3 Adam Wainwright down through No. 8 Dan Haren.  No wonder the NL Cy Young was so controversial.  That’s a pretty tight grouping of pitchers.  I am sure the four runs separating No. 3 from No. 8 are within the margin of error for this metric.<br />
<br />
The other interesting aspect of the chart for me was the variety of run values for hit-ball runs.  Remember these have already been adjusted for the quality of the defense on the pitcher’s team, so what remains should be mostly luck according to DIPS theory.   If so, they should regress back toward zero with multi-year sample sizes.  You’ll have to wait for Part 2 to find out if they do.<br />
<br />
I’ll leave you with a chart of the 10 best relievers by DIRVA+.  WPA totals are another powerful way to judge relievers because the leverage of the gamestate is built in to WPA so I am not sure how much using DIRVA+ for relievers adds, but it can’t hurt to compare the two metrics.       <br />
<pre>Top 10 DIRVA+ Relievers 2009
Pitcher                   Innings   Exp Runs  DIRVA Runs Hit Ball Runs  DIRVA+    LI
1. Michael Wuertz             78        5.4       -16.8         -6.6    -28.8    1.25
2.Joe Nathan                  68        4.7       -12.0        -10.9    -27.6    1.86
3.Andrew Bailey               83        5.7        -9.0        -12.4    -27.1    1.41
4.Jonathan Papelbon           68        4.7       -13.9         -5.8    -24.4    2.17
5. Mariano Rivera             66        4.6        -8.7         -9.0    -22.3    1.72
6. Kiko Calero                60        4.1        -9.3         -8.6      -22    0.93
7. Jeremy Affeldt             62        4.3        -5.6         -9.4    -19.3    1.46
8. Ryan Franklin              61        4.2        -7.8         -7.1    -19.1    1.87
9. Mike Gonzalez              74        5.1       -10.5         -3.4      -19     1.7
10. Phil Hughes               51        3.5       -10.0         -5.5      -19    1.39
11. Jonathan Broxton          76        5.2       -16.6          3.0    -18.8    1.82</pre>I added Jonathan Broxton because of his very high DIRVA runs and his below average hit ball runs.<br /><br /><a href="http://www.hardballtimes.com/main/downloads/" target="new">Click here</a> to learn about THT's download subscriptions.]]>

</description>
      <dc:creator>Peter Jensen</dc:creator>
      <dc:date>2010-03-08T08:34:15+00:00</dc:date>

    </item>

    <item>
      <title>Using HITf/x to measure skill</title>
       
<link>http://www.hardballtimes.com/main/article/using&#45;hitf&#45;x&#45;to&#45;measure&#45;skill/</link>
<guid>http://www.hardballtimes.com/main/article/using-hitf-x-to-measure-skill/#When:05:01:15</guid>       
<description><![CDATA[Ever watch a ballgame and see three fielders converge on a pop fly before it ends up dropping for a base hit?  Did you think that batter didn’t deserve a hit?  Or perhaps the second baseman dove to the shortstop side of second base to catch a screaming line drive and your first thought was "that hitter was robbed."  Well HITf/x was designed for you.  Because we now can have measures of a hitter’s or a pitcher’s ability based not on the vagaries of the plays that the fielders did or did not make, but on the quality of the batter’s hit ball.<br />
<br />
<h3 class="article_title">What is HITf/x and how can it help?</h3><br />
HITf/x uses the same camera-based technology and video footage that Sportvision uses with PITCHf/x to give accurate pitch speed and flight path information for MLB’s Gameday. As a result, the system is able to provide the same accuracy for the hit ball speed and the initial parameters of the ball’s flight path (the vertical and horizontal angles of the ball as it leaves the bat), as PITCHf/x provides for a pitched ball.<br />
<br />
I know that some of you are disappointed that HITf/x cannot also tell us accurately where the ball eventually lands and how long it takes to get there, but that would have required additional cameras to cover the entire field.  Although coverage like that eventually will happen, the additional information that HITf/x can give us without additional cameras is very useful.<br />
<br />
Another benefit is that HITf/x can be calculated from the existing video already captured for PITCHf/x analysis over the last two and a half years, so we will have a usable database of information much more quickly.  Now, we have HITf/x data only for most of the games in April of 2009, but that is enough to demonstrate its power.<br />
<br />
The act of batting is a contest between the batter and the pitcher.  The batter wants the outcome of the plate appearance to improve his team's chances of winning and the pitcher wants the opposite.  In most cases, producing the most wins for the batter's team means maximizing the number of runs his team will score during the inning.  This is what Runs Created, Linear Weights and the other advanced batting metrics estimate.<br />
<br />
Until, almost all of these metrics have used the run value of the event outcome&mdash;out, single, double, triple, home run, walk, etc.&mdash;to determine a hitter’s offensive contribution.  But, as we have long known and as I demonstrated above, the event outcome is not always the best measure of a player’s skill.  The data from HITf/x, including speed off the bat (SOB), vertical angle (VA), and horizontal angle (HA), give us a better method of describing the skill component of the hit ball outcomes of a batter’s plate appearance than event outcomes.  Harry Pavlidis recently took <a href="http://www.hardballtimes.com/main/article/an-early-look-at-hitf-x/" target="new">an early look at some of these parameters</a>.<br />
<br />
<h3 class="article_title">The formula</h3><br />
The methodology for a skill-based batting metric is relatively simple:  Use the usual linear weight values for the non-hit ball events&mdash;strikeouts, non-intentional walks. intentional walks and hit-by-pitches&mdash;but substitute the average outcome of a hit ball described by its SOB, VA and HA for all hit ball events.  I call this metric SDBR, Skill Dependent Batting Runs.  The formula is:<br />
<br />
	SDBR = K_LW + NIBB_LW + IBB_LW + HBP_LW + HIT_BALL_FX_LW<br />
<br />
For the period 2005-2008, K_LW = -.29, NIBB_LW = .32, HBP_LW =.34.  The value that has been usually given for an intentional walk has been .17 runs.  Here I have calculated the IBB value by a different method (see <a href="http://www.hardballtimes.com/main/blog_article/valuing-the-intentional-walk/" target="new">Valuing the Intentional Walk</a>) to give a value of .09 runs that more accurately reflects the average number of runs that will score after an intentional walk.  <br />
<br />
The HIT_BALL_FX_LW was calculated by dividing the 14,625 non-bunt hit balls in the HITf/x Database into 198 different bins based on each hit ball’s SOB, VA, and HA.<br />
&#123;exp:list_maker&#125;For speed off the bat, I defined the bins as 5 mph increments from 80 mph to 100 mph plus a bin for all balls hit less than 80 mph and another for all balls hit above 100 mph.<br />
For vertical angles I used 5 degree increments from -5 degrees to 40 degrees plus a bin for less than -5 and another for more than 40.<br />
For horizontal angles I used three bins; Pulled, Center and Opposite. &#123;/exp:list_maker&#125;<br />
These are obviously arbitrary decisions.  There will always be a tradeoff between  having too many bins that may be measuring random variations rather than real data differences, and too few bins that miss statistically meaningful differences.  Probably the most controversial decision I made is using only three bins for the horizontal angle.  This may underestimate the ability of some batters to control the horizontal angle of their hit balls on the pulled side because certain batters may be able to direct their hit balls into gaps.  When we have more HITf/x data and if further research proves this to be true, then it may be necessary to incorporate more horizontal angle bins.<br />
<br />
We also may decide eventually to create separate bins for left-handed and right-handed batters.  I opted for a more conservative approach at this time.  More HITf/x data also will stabilize the run values for each bin, which are calculated by averaging the linear weight run value of outs, double plays, reached-on-errors, fielder's choices, infield singles, outfield singles, doubles, triples and home runs that occurred in each bin.  The complete HIT_BALL_FX_LW table is included in a spreadsheet you can download at the end of the article.  There are only 197 bins because one bin had no hit balls.<br />
<br />
<h3 class="article_title">The results</h3><br />
Here's an example of what you can see in the data.  This is the average Linear Weight Value of all hit balls based on the speed of the ball and its horizontal angle off the bat.<br />
<pre> Speed      Pull     Center   Opposite   TOT
  <80      -0.14     -0.14     -0.10    -0.13
80 - 85    -0.05     -0.12     -0.09    -0.09
85 - 90    -0.03     -0.11     -0.07    -0.07
90 - 95     0.14     -0.03      0.00     0.04
95 - 100    0.27      0.11      0.18     0.18
  >100      0.53      0.44      0.36     0.47
  TOT       0.08      0.02     -0.02     0.03</pre>As you can see (look in the total column and row), the value of a hit ball goes up as the speed off the bat increases.  Also, pulled hits have more value than hits to the opposite field for all but the most slowly hit balls.<br />
<br />
Over the course of two or three years, a batter’s SDBR will be very close to his traditional linear weight runs.  The reason is that all those "robbed" base hits and the "gimme" base hits will cancel each other out in the larger sample size.  The advantage of SDBR is that it should stabilize over a much smaller sample size than LW-based runs&mdash;possibly in as few as 200 to 300 plate appearances.  We won’t know for sure until we have longer runs of HITf/x data, but if SDBR does stabilize more quickly, then it will provide a much more accurate basis for aging studies and player projections, and it will identify actual changes in a player’s skill level more quickly.<br />
<br />
Another advantage is that the formula for calculating Skill-Dependent Pitching Runs (SDPR) is exactly the same as for SDBR, at least for starting pitchers.  The reason is given in the first sentence of the third paragraph: "The act of batting is a contest between the batter and the pitcher."  When we define the result of that contest in a way that excludes the fielders’ contributions, as SDBR and SDPR do, then the runs that the batter receives when he wins the "contest" are exactly the same as the runs the pitcher loses, and vice versa.  <br />
<br />
You probably recognize the similarity of the SDPR/SDBR formula to the more advanced formulas  for pitching value that have been derived from <a href="http://www.baseballprospectus.com/article.php?articleid=878" target="new">Voros McCracken’s DIPS theory</a>.  SDPR values a pitcher’s actual strikeouts, non-intentional walks, intentional walks, and HBPs just like DIPS, FIP, tRA, xFIP, and LIPS do.<br />
<br />
The difference between SDRP and those formulas is in how SDRP values hit balls in play and HRs.  SDRP values both by their linear weights determined by their initial Hitf/x parameters.  The other formulas use various methods based on event outcomes.   The question of whether to give any predictive value to a pitcher’s balls in play has been controversial since Voros introduced the DIPS concept.  When more data become available through Hitf/x, SDPR should be able to provide a definitive answer to the controversy.<br />
<br />
For relief pitchers, SDPR provides a good basis for projected value because it accurately defines a pitcher's skill in runs.  However, to project his overall future value to his team, it is necessary to adjust his SDPR to account for the leverage of the situations in which he will be used.  This can be done by multiplying his SDPR by the average leverage value of the role in which he will be used using <a href="http://www.hardballtimes.com/main/article/crucial-situations/" target="new">Tom Tango’s Leverage Index</a>.<br />
<br />
In closing, here are a few lists made possible by the HITf/x data.  These were April's "luckiest" batters according to the HITf/x data.  Note that the list includes some of the best groundball hitters in the majors.<br />
<pre>First   Last       Diff
Carl    Crawford    5.9
Akinori Iwamura     5.8
Kevin   Youkilis    5.8
Adam    Jones       5.7
Denard  Span        5.7
Nyjer   Morgan      5.6
Chris   Getz        5.1
Jason   Kubel       4.8
Chase   Utley       4.6
Brad    Hawpe       4.5</pre>And here are the "unluckiest" major league batters.<br />
<pre>First    Last       Diff
Brian    Giles      -9.7
J.J.     Hardy      -6.9
Carlos   Guillen    -6.4
Nelson   Cruz       -5.8
Grady    Sizemore   -5.6
Brandon  Phillips   -5.4
Adrian   Beltre     -4.7
Yunel    Escobar    -4.7
Randy    Winn       -4.7
Russell  Martin     -4.2</pre>You knew Brian Giles wasn't <b>that</b> bad, right?  Now, onto the pitchers.  First up, the "luckiest" hurlers:<br />
<pre>First    Last            Diff
Tim      Wakefield       -8.9
Glen     Perkins         -7.6
Ross     Ohlendorf       -7.1
John     Maine           -6.0
Jair     Jurrjens        -5.9
Joba     Chamberlain     -5.6
Kevin    Millwood        -5.6
James    Shields         -5.6
Koji     Uehara          -5.6
Zack     Greinke         -5.3</pre>You can see DIPS theory at work here, as successful major league knuckleballers tend to have more favorable outcomes on batted balls.  Finally, the "unlucky" ones.<br />
<pre>First    Last            Diff
Vicente  Padilla          5.4
Brett    Myers            4.8
Jon      Lester           4.6
Mark     Hendrickson      4.3
Tim      Lincecum         4.0
Scott    Olsen            4.0
Aaron    Cook             3.9
Andy     Sonnanstine      3.9
Kevin    Slowey           3.8
Ian      Snell            3.2</pre>I wouldn't feel right if Ian Snell weren't on this list.<br /><br /><a href="http://www.hardballtimes.com/main/downloads/" target="new">Click here</a> to learn about THT's download subscriptions.]]>

</description>
      <dc:creator>Peter Jensen</dc:creator>
      <dc:date>2009-06-30T05:01:15+00:00</dc:date>

    </item>

    <item>
      <title>Valuing the intentional walk</title>
       
<link>http://www.hardballtimes.com/main/blog_article/valuing&#45;the&#45;intentional&#45;walk/</link>

<guid>http://www.hardballtimes.com/main/blog_article/valuing-the-intentional-walk/#When:23:37:15</guid>
       
<description><![CDATA[<br /><br /><a href="http://www.hardballtimes.com/main/downloads/" target="new">Click here</a> to learn about THT's download subscriptions.]]>

</description>
      <dc:creator>Peter Jensen</dc:creator>
      <dc:date>2009-06-28T23:37:15+00:00</dc:date>

    </item>

    <item>
      <title>Using Gameday to build a fielding metric (Part 3)</title>
       
<link>http://www.hardballtimes.com/main/article/using&#45;gameday&#45;to&#45;build&#45;a&#45;fielding&#45;metric&#45;part&#45;3/</link>
<guid>http://www.hardballtimes.com/main/article/using-gameday-to-build-a-fielding-metric-part-3/#When:05:01:15</guid>       
<description><![CDATA[I spent a <a href="http://www.hardballtimes.com/main/article_preview/using-gameday-to-build-a-fielding-metric-part-1/" target="new">couple</a> of <a href="http://www.hardballtimes.com/main/article_preview/using-gameday-to-build-a-fielding-metric-part-2/" target="new">articles</a> detailing a methodology for turning publicly available Gameday data into an advanced fielding metric. Now for the results.<br />
<br />
Here is <a href="http://www.hardballtimes.com/images/uploads/BZMs_2008.xls" target="new">a link to download an Excel spreadsheet with 2008 BZMs</a>.  The 2008 BZM All Stars by position (based on Linear Weight Runs Saved per 150 games) were:<br />
<br />
1B: Lance Berkman (-29)<br />
2B: Adam Kennedy (-19)<br />
SS: Mike Aviles (-23)<br />
3B: Scott Rolen (-28)<br />
LF: Jay Payton (-31)<br />
CF: Jody Gerut (-36)<br />
RF: Franklin Gutierrez (-41)<br />
<br />
The spreadsheet speaks for itself.  But the numbers may not always speak clearly, so I’ll be happy to answer questions about them.   There are some surprise rankings, but they're mostly what you would expect.   I have made every effort to check for mistakes in calculations, but that doesn’t mean that mistakes haven’t occurred.  If you spot anything that looks suspicious, please let me know.  I’ll welcome the fresh eyes.  There are 249 Access queries just for the fielding metric portion alone, so there was plenty of opportunity to make stupid mistakes. <br />
<br />
Here are some advance answers to some general questions that I am sure will come up.<br />
<br />
<b>Are the Gameday data as good as STATS or BIS data?</b>  No.  I have already discussed the problem of Gameday giving the location of where the ball is picked up instead of where the ball lands on base hits.  But in addition some fielding locations are missing and others seem to be clearly in error.  The Gameday data had to be normalized because of differences in the hit recording diagrams and the normalization process will create errors.  <br />
<br />
<b>Won’t this lack of quality in the Gameday data make the BZM numbers much less accurate than UZR or Plus/Minus?</b>  Even though the raw Gameday data are less accurate than BIS or STATS raw data, it is still very close.  The omissions and errors that I mentioned above are present in only about 1 percent of the raw data and are less in the most recent years.<br />
<br />
All three data sources depend on human observation and no one knows which of the human observers has been most accurate in the data collection.  The variation caused by errors in the Gameday data appears to be very small compared to the variation caused by the human observation process.  And the final differences in the processed numbers of BZM, UZR  and Plus/Minus are much more a result of the analysis within the metrics themselves than in the data they use.<br />
<br />
<b>Why did you choose to use the single supersize zone instead of the smaller zones in UZR and Plus/Minus?</b>  One reason was the question of accuracy in the Gameday raw data.  A single supersize zone meant that any inaccuracies would affect the analysis only at the two edges of the zone instead at all the edges of the smaller zones.  But even with guaranteed accurate raw data, a single super zone may still be the best method.<br />
<br />
You want your fielding metric to be measuring data that reflect an actual skill difference of the fielders and not a quirk in the distribution of hit balls to those fielders.  The smaller sample size of a single year’s data combined with the even smaller samples of the smaller zones makes the chance that you are measuring fielding distribution quirks instead of skill far more likely.<br />
<br />
<b>Why didn’t you then choose to use no zones at all like TotalZone, SFR and OPA!?</b>  I considered it.  I had developed a whole-field fielding metric of my own several years ago.  The problem that I saw was in calculating the expected outs.  Having no zones meant that the areas of responsibility for adjacent fielders would overlap, making the calculation for expected outs for each fielder more complicated.  I thought the supersize zone system was the better solution.<br />
<br />
<b>How important is the location of where you set the boundaries for each supersize zone?</b>  I truly don’t know.  The main difference would be in the calculation for OOZ runs.  Currently, if a player makes a play out of zone he gets the out value for making that play. On the other hand, he doesn’t receive the run value for saving a hit on that play as he does if the ball had been in his zone.  I chose to do this because I felt there was no assurance that the ball would have been a hit rather than have been caught by the adjacent fielder.<br />
<br />
There are certainly many cases in which it would have been a hit and the fielder making the OOZ play is being cheated out of runs that he deserves.  But there are certainly cases where the adjacent fielder also could have made the play, so that giving the full hit value for an OOZ play made would be overly rewarding the player.  There may be a compromise solution that would treat each position’s OOZ plays in an individual manner. Or changing the superzone boundaries a bit may yield better results.<br />
<br />
<b>Why did you feel that it was better to use your method of park adjusting for outfielders than the traditional method of calculating park factors?</b>  I tried the traditional method and found that the year-to-year variation of the results was more than I thought was correct, even using multiple years.  Some of that variation was because the traditional method is affected by changes in the other parks.  Other parks don’t affect the park adjustment calculated by method.<br />
<br />
My park adjustment is biased by the quality of the fielders who actually played in that park.  I don’t adjust for that and BZM would be more accurate if I did, but the calculations to do so seemed too involved for a minimal gain.  The biggest problem would be in the new Washington Park where there is only one year of data on which to park adjust.<br />
<br />
<b>Why did you choose 1/12 as the weighting for home field stats in the park adjustment?</b>  Things would be a lot easier if there were still balanced schedules and no interleague play.  I would then use 1/(x-1) and every team would be equally represented in each park for each separate league.  I chose 1/12 for the home team because I wanted the home team to be represented at an amount that was closer to that of the teams in its own division than 1/13 or 1/15 would have been.  I experimented with other fractions in my earlier whole field metric and found that the adjustment wasn’t particularly sensitive to the actual fraction that was chosen.<br />
<br />
<b>You say that you used batter handedness as a factor in your calculations, but I don’t see it in the spreadsheet.  Where is it?</b>  It is one step back in the process.  The rankings are the final step where the numbers from left-handed and right-handed batters have been combined.  Although there is a significant difference in the rates at which balls hit to the opposite field are fielded for an out, the differences in how each fielder performed were minimal, and I didn’t find them interesting enough to report. <br />
<br />
<b>Why didn’t you calculate arm ratings for outfielders like UZR and Plus/Minus do?</b>  I am not confident in the methods that I have seen used to calculate arm ratings.  There are certainly differences in the abilities of outfielders to throw a ball strongly and accurately, but in the calculations that I have seen those differences in skill are overwhelmed by the variation created by other factors outside of the fielder’s control.<br />
<br />
For now the difference between RVA/150 and LW/150 when calculated over multiple years reflects an outfielder's arm skill and his strategy in adjusting to different base-out situations.  For that reason I would use the multiple year RVA/150 for projecting a player’s future true ability when I had multiple years available.  Otherwise, I would stick with LW/150 and assume that arm ratings are not that important.<br />
<br />
<b>Are you going to do BZMs for pitchers and catchers?</b>  Pitchers, yes, eventually.  Catchers, no. Most of a catcher's value is in how he handles his pitching staff and very little on hit balls.  I haven’t tackled pitchers’ fielding yet because it is not usually calculated and I wanted to make the other numbers available for comparison purposes.<br />
<br />
<b>Are you going to evaluate bunts, popups and line drives for infielders and ground balls for outfielders?</b>  Bunts, yes, eventually, for third basemen, firstr basemen and pitchers.  Popups and line drives, probably no.  Ground balls for outfielders, I’m not sure.<br />
<br />
Covering bunts is a fielding skill and should be evaluated.  Covering popups is also a fielding skill.  Unfortunately, I am missing a vital piece of information to evaluate the skill properly.  I need to know how many foul popups don’t get caught.  Infield line drives are mostly being in the right spot at the right time and variations in hit ball distribution.  Covering outfield ground balls and preventing extra bases is a skill and should be evaluated, but Gameday doesn’t give enough information to do so.  With HITf/x it might be possible.<br />
<br />
Finally, thanks to Retrosheet for making this kind of analysis possible.  Some of the information used here was obtained free of charge from and is copyrighted by Retrosheet.  Interested parties may contact Retrosheet at <a href="http://www.retrosheet.org">http://www.retrosheet.org</a>.<br /><br /><a href="http://www.hardballtimes.com/main/downloads/" target="new">Click here</a> to learn about THT's download subscriptions.]]>

</description>
      <dc:creator>Peter Jensen</dc:creator>
      <dc:date>2009-03-17T05:01:15+00:00</dc:date>

    </item>

    <item>
      <title>Using Gameday to build a fielding metric (Part 2)</title>
       
<link>http://www.hardballtimes.com/main/article/using&#45;gameday&#45;to&#45;build&#45;a&#45;fielding&#45;metric&#45;part&#45;2/</link>
<guid>http://www.hardballtimes.com/main/article/using-gameday-to-build-a-fielding-metric-part-2/#When:05:20:15</guid>       
<description><![CDATA[If you read <a href="http://www.hardballtimes.com/main/article/using-gameday-to-build-a-fielding-metric-part-1/" target="new">Part 1 of this series</a>, and you are still interested, you may be wondering whether the MLB hit location data is worth all the trouble.  Is the data accurate enough to be useful in any serious analysis?  It does suffer from a serious flaw.<br />
<br />
Since it was created to fill the specific need of providing information for <a href="http://www.mlb.com/mlb/gameday/" target="new">Gameday’s</a> Player Hit Charts, a conscious decision was made to record where a hit ball was ultimately fielded rather than where it first hit the ground.  So a line drive hit that barely clears the shortstop's glove and first hits the ground at 225 feet but rolls all the way to the wall will be recorded at 350 feet.<br />
<br />
At first this may seem to make the data unusable, but for a fielding metric it's not much of a detriment.  Assigning responsibility for hit balls to a specific fielder is more about having an accurate measure of angle than it is about having an accurate distance.  Every analyst, of course, wants it all: every scrap of data that can be collected as accurately as possible.  That’s why we often look starry eyed toward the future when electronic collection will tell us where the ball and the fielders are to within a fraction of a foot for every millisecond of the game.  Oh, and we want it for free, of course.<br />
<br />
The Gameday data is free, and two other sources of hit location data, <a href="http://www.stats.com" target="new">STATS</a> and <a href="http://www.baseballinfosolutions.com" target="new">Baseball Info Solutions</a>, are not.  BIS is the source for John Dewan’s <a href="http://www.actasports.com/detail.html?id=990" target="new">plus/minus fielding metric</a> which many feel is one of the best fielding metrics available.<br />
<br />
Its main competitor is Mitchel Lichtman’s <a href="http://www.baseballthinkfactory.org/files/primate_studies/discussion/lichtman_2003-03-14_0/" target="new">Ultimate Zone Rating</a> (UZR), which has been calculated from both BIS and STATS data.  Using the same metric on the two different data sources had the unexpected result of giving <a href="http://www.insidethebook.com/ee/index.php/site/comments/uzr_on_fangraphs_using_bis_on_ichiro/" target="new">substantially different fielding values for identical players</a>.  This was seen as a setback for the confidence in defensive metrics in general, but for our purposes it’s more of an opportunity.  It’s harder to criticize a metric based on Gameday hit locations when the two for-pay sources have such a large margin of error.  <br />
<br />
I don’t have the entire BIS and STATS datasets, but I do have hit-ball data from both sources for over 500 outs from last year’s <a href="http://www.hardballtimes.com/thtstats/main/player/731/torii-hunter" class="player">Torii Hunter</a>/<a href="http://www.hardballtimes.com/thtstats/main/player/96/andruw-jones" class="player">Andruw Jones</a> project.  For that small subset the standard deviation of the differences between Gameday and either BIS or STATS hit-ball angles were 3.1 and 3.4. The standard deviation between BIS and STATS was about 2.8.<br />
<br />
I concluded that a UZR or plus/minus type system that divides the field into small sectors and compares a fielder’s performance in each sector with an average fielder would probably suffer problems from this amount of potential error.  The best use of the data would seem to be as a supplement to a whole-field type of fielding metric. <br />
<br />
Several whole-field fielding metrics have been published over the last year or so. <a href="http://www.hardballtimes.com/main/article/measuring-defense-for-players-back-to-1956/" target="new">Sean Smith’s TotalZone</a> was the first to be published, followed by <a href="http://www.baseballprospectus.com/article.php?articleid=7072" target="new">Dan Fox’s Simple Fielding Runs (SFR)</a>, and <a href="http://statspeak.net/2008/11/the-2008-opa-gold-and-lead-gloves.html" target="new">PizzaCutter’s OPA!</a>.  Each uses the Retrosheet play-by-play data to construct a system that has more accurate inputs (and hopefully more accurate results) than non-PBP metrics.  At the time they were conceived, their authors admitted that they could not hope to be as accurate as a system that included hit location data, but neither UZR nor Dewan’s plus/minus results were being made publicly available for the current year.<br />
<br />
Since then the restrictions on UZR have been lifted and UZR fielding values are available on <a href="http://www.fangraphs.com" target="new">Fangraphs</a> and plus/minus results are available on <a href="http://www.billjamesonline.net" target="new">Bill James Online</a>.  If UZR and plus/minus were in close agreement for player’s fielding values, there would be little value in constructing a new fielding metric based on Gameday data.  But they are not.  A Gameday-based fielding metric has just as much chance to be taken seriously on its own merits.<br />
<br />
Plus, the process of constructing a metric from Gameday data has much to offer in evaluating the relative value of zone-based versus whole-field metrics, and how fielding metrics should be constructed in the future when more accurate hit-ball data becomes available.<br />
<br />
Some people have said that it is logically impossible to construct a whole-field type fielding metric that will be as accurate a zone-based system.  On its face the argument seems to have merit.  How can you gain accuracy by ignoring more detailed information?<br />
<br />
The answer is, you can’t.  You can’t if you are positive that the more detailed information is a true reflection of the skills you are trying to measure, and not the result of either measurement error or an erratic distribution caused by small sample size.<br />
<br />
If measurement error and small sample size are potential problems, and they are with zone data, then a possible solution is aggregating the data into larger samples.  Measurement errors tend to cancel out and distributions smooth in larger sample sizes.  This is not a new concept in sabermetrics.  Both our batting metrics and pitching metrics are based on aggregated data.<br />
<br />
<h3 class="article_title">Details, details</h3><br />
Where to begin?  All fielding metrics share the same two basic concepts.  How many plays did a fielder make compared to how many plays he "should" have made? And how many runs did he save on his made plays compared to the number of runs cost on the plays he did not make?  Sounds simple, but the devil is in the details, and there are a lot of details.<br />
<br />
"Plays made" is relatively simple.  If a player fields a ball and causes either the batter to be out by catching the ball in the air or forcing him at first, or forces another runner. it always counts as a play made.  If the batter or runner would have been out but the fielder receiving the thrown ball makes an error, it counts as a play made.<br />
<br />
The only controversial decision on plays made is how to handle the fielder’s choice.  Most analysts classify a fielder’s choice as a play made when an out occurs on the play.  That is the method I have chosen.  But there are arguments for ignoring all fielder’s choices, or classifying them all as plays made.  An individual fielder usually averages fewer than two fielder's choices a year, so the practical consequences of choosing one method over another are minimal.<br />
<br />
Whole-field systems and Dewan’s plus/minus count all of a fielder’s plays made in one calculation.  Zone systems divide them into in-zone plays and out-of-zone (OOZ) plays.  I have chosen to divide the infield into four large zones, each the responsibility of a fielder other than the pitcher.  Likewise, the outfield is divided into three zones, one for each outfielder.  Plays made are divided into in-zone plays and out-of-zone plays.<br />
<br />
Deciding how to calculate how many plays a fielder should have made (Expected Outs) is one of the two crucial decisions for any fielding metric.  For zone-based systems, the typical method is to calculate the percentage of plays that are made by a league average fielder for a ball in that zone and multiply that percentage times the number of balls hit in that zone for each fielder.  This raw number can be further adjusted by park factors, a pitching factor, or a guesstimate of how hard the ball was hit.<br />
<br />
The details of how each fielding metric handles these decisions are most often not available, and the differences between metrics can have large effects on the results.  Dewan’s original plus/minus did not adjust for park effects for outfielders.  He has since implemented a park effect by designating balls that hit the outfield wall above an outfielder’s reach as unfieldable.  I believe that UZR park-adjusts in a conventional manner for both infielders and outfielders.<br />
<br />
I have chosen to park-adjust for outfielders only, and to do it by a method used in some fashion by each of the whole-field metrics.  Instead of creating a park-zone fielding factor based on home and away splits, I use a park-average fielding percentage based on a multi-year average of all fielders who have played that outfield position in that park.  I have weighted the factor by using all the stats from visiting players and 1/12 the stats from home players; this way, a particularly good or bad home team fielder won’t overly influence the results.  Other whole-field systems have chosen different weighting methods.<br />
<br />
I do not adjust for pitching quality or hit-ball speed.  But I do adjust for the handedness of the batter.<br />
<br />
I also make some adjustments for infielders that plus/minus or UZR might make (I'm not sure).  It seems obvious that if the ball is fielded by a player before it reaches a shortstop then it shouldn’t count as a chance for the shortstop.  So I remove from the total of hit balls in the shortstop’s zone all those balls fielded by a fielder in front of the shortstop.  I make the same adjustment for all infielders even though it mostly affects shortstops and second basemen.<br />
<br />
Whole-field metrics have an issue that zone systems don’t have: How to allocate the ground ball hits that go past an infielder or the air ball hits that fall between the outfielders.  Because they have no hit-ball location information, whole-field metrics don’t know who the closest fielder was to the hit.  The only solution is to use a fixed allocation based on league averages.  But a fixed allocation cannot take into account the relative fielding abilities of the two adjacent fielders.<br />
<br />
If the league average says that a shortstop usually is responsible for 60 percent of the ground ball hits that go between an average shortstop and third baseman, it causes significant inaccuracies to apply that percentage to a specific fielder when a shortstop is either much better or much worse than the third baseman next to him.  This is one of the main failings of whole-field metrics and is the reason they report results with less range between good and bad fielders at a position than do zone-based systems.<br />
<br />
<h3 class="article_title">Converting plays to runs</h3><br />
I end up with a plus/minus number of outs for each fielder’s one big zone and a separate plus/minus for the plays he makes outside of his zone.  I then have to convert the plus/minus scores into runs.  This conversion from plus/minus plays to runs is one of the two big sources of variation in the reported results of the different metrics.<br />
<br />
The simplest method (let's call it the "generic value") is to multiply plus/minus plays made by the average linear weight difference between an out and an average hit, usually estimated at  about .8 or infielders and .9 for outfielders.  This method is simple, but there is no excuse for using it for all positions.  For shortstops and second basemen it gives a reasonable value because almost all hits on plays at those positions are singles.  But even for those two positions there are differences between fielders in the number of hits that are infield singles or singles to the outfield.<br />
<br />
There are also differences in the number of errors made.  Third basemen and first basemen have to adjust their on-field positioning due to the additional risk of allowing doubles if a hit ball gets by them on the foul line side.  For outfielders, the distribution of singles, doubles and triples allowed is a key driver of their total runs allowed. <br />
<br />
One possible solution is to track the plus/minus of infield singles, outfield singles, doubles, triples and errors just as you track plus/minus for plays made.  You can then apply the appropriate linear weight (LW) to each event and create a plus/minus run total.   Another solution is to measure the run value added (RVA) for each fielding event directly by calculating the increase of expected runs scored from the base-out state prior to the play to the base-out state after the play, adding any runs that actually scored on the play.<br />
<br />
The advantage of the linear weights (LW) method is that it avoids run variation due to uneven distributions of base-out states.  The advantage to the RVA method is that it includes runs differences from double plays and arm ratings without having to create separate calculations.  It is also possible that fielders make strategic decisions on how they will play the ball for a given base-out state.  A practical consideration is that the RVA method is much easier to calculate.<br />
<br />
I haven’t completely decided where I stand on the issues presented by the LW and RVA systems, so I am going to present both results for this metric.  By now you are probably thoroughly confused.  The best way to understand the methodology is to see an example.<br />
<br />
Here is part of the 2008 center field line for <a href="http://www.hardballtimes.com/thtstats/main/player/589/carlos-beltran" class="player">Carlos Beltran</a>. An FB is a fly ball, an LD is a line drive; EXP stands for Expected and ACT stand for Actual.<br />
<br />
<table width="100" cellspacing="1" cellpadding="2" border="1"><tr><td align="center" colspan="2">CHANCES</td><td align="center" colspan="4">OUTS</td><td align="center" colspan="4">OOZ_OUTS</td><td align="center" colspan="4">SINGLES</td>	<td align="center" colspan="4">XBASES</td></tr><tr><td align="center">FB</td><td align="center">LD</td><td align="center" colspan="2">FB</td><td align="center" colspan="2">LD</td><td align="center" colspan="2">FB</td><td align="center" colspan="2">LD</td><td align="center" colspan="2">FB</td><td align="center" colspan="2">LD</td><td align="center" colspan="2">FB</td><td align="center" colspan="2">LD</td></tr><tr>	<td align="right">ACT</td><td align="right">ACT</td><td align="right">EXP</td><td align="right">ACT</td><td align="right">EXP</td><td align="right">ACT</td><td align="right">EXP</td><td align="right">ACT</td><td align="right">EXP</td><td align="right">ACT</td><td align="right">EXP</td><td align="right">ACT</td><td align="right">EXP</td><td align="right">ACT</td><td align="right">EXP</td><td align="right">ACT</td><td align="right">EXP</td><td align="right">ACT</td></tr><tr>	<td align="right">397</td><td align="right">248</td><td align="right">318</td><td align="right">328</td><td align="right">42</td><td align="right">44</td><td align="right">25</td><td align="right">42</td><td align="right">NA</td><td align="right">NA</td><td align="right">25</td><td align="right">26</td><td align="right">146</td><td align="right">157</td><td align="right">31</td><td align="right">24</td><td align="right">27</td><td align="right">16</td></tr></table><br />
<br />
So the first thing we see is that Carlos had a very good year.  He was a +12 for balls in his zone (subtracting the differences between Actual and Expected for both fly balls and line drives) and an astounding +17 on balls OOZ.  His zone of responsibility ran from 74.7degrees to 102 degrees for left handers and 76.5 to 103.7 degrees for right handers.  It may seem counterintuitive to give the CF less than an angular third of the field but you have to remember that there is a lot more square footage to cover in that 27 degree slice.  I drew the boundaries where roughly 50% of the fly ball hits fell on each side of the line for the league as a whole.<br />
<br />
One might wonder whether Beltran’s +17 OOZ was due to ball hogging.  I haven’t done a full study, but a preliminary look at the data seems to indicate that outfielders are pretty astute at judging the range ability of their fellow outfielders and positioning themselves so that they have about an equal chance of reaching hit balls between them.<br />
<br />
The second observation from this data is that there is not a lot of room to show ability by catching more than the average number of line drives.  But a speedy centerfielder can save a lot of runs by turning line drives into singles rather than doubles or triples.  Beltran has nine fewer LD extra base hits than expected and 7 fewer FB extra base hits.  Let’s look at how this all translates into runs, using our three types of run methodologies:<br />
<br />
<table width="100" cellspacing="2" cellpadding="2" border="2"><tr><td align="center" colspan="2">GENERIC_RUNS</td><td align="center" colspan="2">LW_RUNS</td><td align="center" colspan="2">RVA_RUNS</td><td>LW_RUNS/150</td><td>RVA_RUNS/150</td></tr><tr><td align="right">FB</td><td align="right">LD</td><td align="right">FB</td><td align="right">LD</td><td align="right">FB</td><td align="right">LD</td><td align="right">Total</td><td align="right">Total</td></tr><tr><td align="right">(-26)</td><td align="right">(-1.6)</td><td align="right">(-12.7)</td><td align="right">(-4.0)</td><td align="right">(-9.4)</td><td align="right">(-5.4)</td><td align="right">(-16.3)</td><td align="right">(-14.5)</td></tr></table><br />
<br />
Beltran by any measure had a very good year.  Oh, minus values are good and positive values bad.  Just the way I think about runs on defense.  Minus means a run saved.  I know it’s weird but get used to it because that’s what you are going to see in the leaderboards and spreadsheets in the next article.<br />
<br />
Anyway, linear weights has Beltran saving 16.7 runs in 2008.  Not quite as much as a generic approach at 27.6 and a little more than Run Value Added at 14.8.  Translated into a rate stat, linear weights is 16.3 runs saved per 150 games where the game is defined asthe average number of chances per game for a center fielder of 4.2.  Defining games by chances makes more sense than using innings since what we really want to know is how many runs will Beltran per opportunity.  <br />
<br />
Just for fun, let’s compare these results with UZR and Dewan's plus/minus.  UZR has Beltran saving seven runs in 2008.  Interestingly, UZR shows 377 expected outs and 418 put outs.  I have him at 385 expected outs with 414 plays made.  The slightly fewer plays made is due to my ignoring line drive outs OOZ and also ignoring plays for which Gameday has no hit location data. How MGL interprets a +41 plays made over expected as only being worth 7 runs will have to be explained by him.<br />
<br />
Plus/Minus gives Beltran 404 expected outs and +14 plays made.   This is adjusted to +24 in the "enhanced" model and 14 runs saved (not including runs saved with his arm).<br />
<br />
Since the methodology is different for infielders, let’s look at one.  Here are <a href="http://www.hardballtimes.com/thtstats/main/player/5209/alex-gordon" class="player">Alex Gordon</a>’s 2008 numbers at third base: <br />
                                               <br />
<table width="100" cellspacing="1" cellpadding="2" BORDER="1"><tr><td ALIGN="CENTER">HB</td><td ALIGN="CENTER">PP</td><td ALIGN="CENTER">CHANCE</td><td colspan="2" ALIGN="CENTER">OUTS</td><td colspan="2" ALIGN="CENTER">OOZ</td><td colspan="2" ALIGN="CENTER">I_1B</td><td colspan="2" ALIGN="CENTER">SSS</td><td colspan="3" ALIGN="CENTER">O_1B</td><td colspan="2" ALIGN="CENTER">2B</td><td colspan="2" ALIGN="CENTER">ERR</td><td ALIGN="CENTER">DP_OPP</td><td colspan="2" ALIGN="CENTER">DP</td></tr><tr><td ALIGN="RIGHT">T</td><td ALIGN="RIGHT">T</td><td ALIGN="RIGHT">T</td><td ALIGN="RIGHT">E</td><td ALIGN="RIGHT">A</td><td ALIGN="RIGHT">E</td><td ALIGN="RIGHT">A</td><td ALIGN="RIGHT">E</td><td ALIGN="RIGHT">A</td><td ALIGN="RIGHT">E</td><td ALIGN="RIGHT">A</td><td ALIGN="RIGHT">E</td><td ALIGN="RIGHT">AC</td><td ALIGN="RIGHT">AV</td><td ALIGN="RIGHT">E</td><td>A</td><td ALIGN="RIGHT">E</td><td ALIGN="RIGHT">A</td><td ALIGN="RIGHT">T</td><td ALIGN="RIGHT">E</td><td ALIGN="RIGHT">A</td></tr><tr><td>310</td><td ALIGN="RIGHT">5</td><td ALIGN="RIGHT">305</td><td ALIGN="RIGHT">170</td><td ALIGN="RIGHT">174</td><td ALIGN="RIGHT">29</td><td ALIGN="RIGHT">31</td><td ALIGN="RIGHT">18</td><td ALIGN="RIGHT">27</td><td ALIGN="RIGHT">11</td><td ALIGN="RIGHT">6</td><td ALIGN="RIGHT">65</td><td ALIGN="RIGHT">62</td><td ALIGN="RIGHT">57</td><td ALIGN="RIGHT">17</td><td>4</td><td>11</td><td ALIGN="RIGHT">13</td><td ALIGN="RIGHT">45</td><td ALIGN="RIGHT">19</td><td ALIGN="RIGHT">16</td></tr></table><br />
<br />
HB is the number of ground balls hit in the third base zone (40 degrees to 65.6 right, and 64.7 left).  PP is the number of plays made by the pitcher in that zone (five is a little above average).  Third basemen ranged from 0 PP for <a href="http://www.hardballtimes.com/thtstats/main/player/9368/evan-longoria" class="player">Evan Longoria</a> in slightly fewer games to 12 for <a href="http://www.hardballtimes.com/thtstats/main/player/719/casey-blake" class="player">Casey Blake</a> in fewer games but more HB.  OUTS are non-double play outs.<br />
<br />
Gordon does a little better than expected, with four more outs and two more outs OOZ.  For a third baseman, OOZ outs means that he was moving to his left into the shortstop's zone.  Combined with the higher-than-usual number of pitcher plays, this might be an indication that batters were hitting balls off his pitchers with less than usual speed off the bat.  Or he might be cheating a little towards short.  Or he might just be better at going to his left.<br />
<br />
I_1B are infield singles.  His much-higher-than-usual number could mean that he got to more hit balls, preventing them from going to the outfield.  Or it could mean that he was slower getting the ball out of his glove and/or not getting as much on the throw to first.  SSS is shortstop saves.  It is the number of balls in the 3B zone that get by the third baseman and are fielded by the SS for an out.  The lower than average number means he wasn’t getting much help from <a href="http://www.minorleaguesplits.com/cgi-bin/pl.cgi?pl=449107" class="player" target="new">Mike Aviles</a> and the other KC shortstops.<br />
<br />
O_1B is the number of outfield singles.  Sixty-five is the expected number, 62 was Gordon’s actual number and 57 the projected actual number if Gordon had had an average ability shortstop playing next to him.  2B is the number of ground ball doubles that were picked up in the 3B zone.  Apparently, Gordon wasn’t cheating toward the shortstop to get his OOZ outs.  Allowing only four actual doubles for the number of chances he had is extraordinary.  His differential of 13 was the best in the league.  Only <a href="http://www.hardballtimes.com/thtstats/main/player/1605/bill-hall" class="player">Bill Hall</a>, with 11, came close.  Finally, Gordon’s 16 DPs in 45 opportunities was three fewer than expected.<br />
<br />
Gordon was good in 2008, and projects to be a better-than-average third baseman in the future.  However, all of his good stats and all of the areas where he underperformed could be partially explained if his pitchers were allowing hit balls that were hit with less speed than average.  With slower-hit balls he is able to field more of them, allowing fewer to get by him to either be fielded by the SS or turn into outfield singles or doubles.  But he also has less time to make the throw to first or to second for a double play so he appears to underperform in those areas.  We anxiously await Hit f/x to provide more answers.<br />
<br />
With all of this Gordon managed a LW/150 of -7.8 (remember minus is good), 12th in the league behind <a href="http://www.hardballtimes.com/main/stats/players/index.php?lastName=rolen" class="player">Scott Rolen’s</a> leading -28.7.  But remember those five extra outs Gordon would have had if he had an average SS playing next to him.  That is counted in his run total, so his projection would be even better.<br />
<br />
UZR has Gordon at a UZR/150 of 4.8 runs worse than average.  Dewan has him at -9 GB plays worse than average and a rank of 27th.  So there is plenty of disagreement here.  Both UZR and Dewan use either pitcher quality and/or the speed of the ball to make adjustments to the raw data&mdash;I don’t do either.  Perhaps they have concluded that Gordon was getting really, really easy ground balls to field.  Time will tell. <br />
<br />
I hope you have seen the possibilities that having detailed fielding numbers can provide for analysis.  The only thing left for today is to give my fielding metric a name.  Since its main feature is the use of supersize zones I am going to call it “Big Zone Metric” or BZM.  Part 3, which will run next Tuesday, will take a look at the best BZMs for 2008 at each position.<br /><br /><a href="http://www.hardballtimes.com/main/downloads/" target="new">Click here</a> to learn about THT's download subscriptions.]]>

</description>
      <dc:creator>Peter Jensen</dc:creator>
      <dc:date>2009-03-12T05:20:15+00:00</dc:date>

    </item>

    <item>
      <title>Using Gameday to build a fielding metric (Part 1)</title>
       
<link>http://www.hardballtimes.com/main/article/using&#45;gameday&#45;to&#45;build&#45;a&#45;fielding&#45;metric&#45;part&#45;1/</link>
<guid>http://www.hardballtimes.com/main/article/using-gameday-to-build-a-fielding-metric-part-1/#When:05:01:15</guid>       
<description><![CDATA[Ever since the publication of Joseph Adler’s <a href="http://www.amazon.com/gp/product/0596009429?ie=UTF8&tag=thehartim-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=0596009429">Baseball Hacks</a> in 2006 made me aware of the existence of MLB Gameday’s hit location data, I have been excited about the possibility of using this data in baseball analysis.  <a href="http://www.retrosheet.org" target="new">Retrosheet</a> is a magnificent resource that remains the rock on which all amateur, but serious, baseball statistical research is based.  But one of the few things that Retrosheet lacks is reliable hit location data.<br />
<br />
Hit location information has been securely in the hands of the for-profit baseball data collectors, such as <a href="http://www.baseballinfosolutions.com" target="new">Baseball Info Solutions</a> and <a href="http://www.stats.com" target="new">STATS Inc.</a>  Getting access to it seemed beyond my means.  On the other hand, <a href="http://www.mlb.com/mlb/gameday/" target="new">MLB Gameday</a> seems like a realistic, viable, no-cost alternative.  But the process of getting good fielding data from Gameday has been much more difficult than I envisioned.<br />
<br />
MLB Gameday uses hit location data primarily to create the hit charts shown in the statistical information for batters.  It is collected by contract employees of MLB who sit in the press box with a laptop computer and enter the hit location with a cursor point directly onto an image of the park.  The field image is just 250 by 250 pixels.  The x and y pixels of the hit location are then stored in the Gameday XML files. Fractions of a pixel appear in the hit location files, apparently due to a translation from the coordinate input file to a slightly different output coordinate system.<br />
<br />
The information is used by Gameday for images and entertainment purposes and was never intended for analysis.  But the person inputting the data is instructed to make the information as accurate as possible, given the limitations of the system.  The hit location information also includes the identity of the hitter and pitcher, a brief description of the hit ball outcome (Single, Fly out or Error, for example), the inning, and notations for hit or out, and home or away batting team. <br />
<br />
To use the hit locations for fielding analysis, the raw data has to be downloaded and appended to an existing play-by-play database and the location in pixels converted to on-field X and Y coordinates in feet, and ultimately angle and distance format.  Although appending the data to Retrosheet poses its own set of problems, they are not insurmountable.  Today’s discussion is about converting the pixel information.<br />
<br />
There are two steps to the process: finding the exact location of home plate, and establishing a multiplier to convert pixels to feet.<br />
<br />
When I first began this process for <a href="http://www.hardballtimes.com/main/article/is-seeing-believing/" target="new">an article on observational data</a> that I wrote last year for the Hardball Times, I had assumed that home plate location and distance multiplier would be the same for each park.  I was wrong.<br />
<br />
For Gameday’s purposes, the data only has to be keyed to their own park image.  Since they scale and locate the image to maximize the field area within the 250-by-250 pixel box, the home plate location and distance multipliers are different for each field; markedly different on a few fields, but not exactly the same on any of them.<br />
<br />
One way to resolve these differences would be to have pixel maps of all the fields, and to actually pinpoint the exact pixel locations of the back corner of home plate and the foul poles.  But when I explored this possibility with Corey Schwartz at MLB.com I discovered another difficulty.  Some of the maps were changed during the winter between the 2007 and 2008 seasons to eliminate the largest inconsistencies between the fields.  It was great that MLB was trying to improve the accuracy of its data gathering, but for anyone who wanted to use multiple-year hit location data, it meant that there were potentially 60 data collections that needed to be adjusted instead of 30.<br />
<br />
Rather than pursuing the graphical method of adjusting the data, I opted for an alternative method.  I began with the assumption that certain classes of hit balls would have similar distribution patterns in all the parks over the course of a season.  I could also impose some physical constraints to the data.  For instance, ground balls fielded for outs by the infield would have to be located between the foul lines.  So would be almost all the home runs.  Line drives fielded by the pitcher and most ground balls would have to be fielded closer than 60 feet to home plate.<br />
<br />
I also used Greg Rybarczyk’s <a href="http://www.hittrackeronline.com" target="new">HitTracker</a> estimates of home run distance and angle to act as a reality check.  With this conceptual framework I set out to normalize the data between fields for the two sets of data: 2005-7 and 2008. <br />
<br />
In the past I had used the solver function of Excel for similar problems, but the number of variables for this problem exceeded its capacity, so I proceeded by hand in my existing Access database.  Because of this, there is no guarantee that the numbers I ended up with are the very best possible.  And, of course, there is no good way of checking how accurate my initial assumption of uniform hit ball distribution was.<br />
<br />
But the results met most of my physical constraints.  I actually ended up normalizing using only the non-bunt-ground ball outs to infielders.  I excluded outfield information because the different fence differences in different parks caused outfield caught-ball average locations to vary between the parks.  I did check the outfield caught-ball locations for each park, and they varied in a manner consistent with their outfield fence distances.<br />
<br />
The data normalized nicely.  The average angles of hit-balls outs for each park could be brought within 1.5 degrees of the league average at each position in almost all cases.  The ground ball infield distances were almost always within 2 feet of the league average.<br />
<br />
There were, however, two problems that emerged during the normalization process.  I had assumed from my conversation with Cory Schwartz that the only change in data collection was the redrawing of the fields for 2008, so I normalized all the pre-2008 data together.  But when I double-checked the individual yearly totals, it was apparent that something was wrong with the data for Coors Field for 2007.  For some reason the home plate location was drastically off.  Consequently I re-normalized the Coors 2007 data and individual numbers are given for Coors 2007 in the table.<br />
<br />
The second problem involved the outfield distances.  They were very different between pre-2008 and 2008.  When the calculated data for infield ground outs was normalized to with a foot or so, the 2008 outfield out distances for 2008 were consistently much longer than those for pre-2008. This was true for each field in each park.<br />
<br />
The only explanation that I could determine was that when the fields were redrawn they were not drawn with a consistent scale between outfield and infield.  So a single distance factor multiplier that was correct for infield distances would be off when applied to the outfield.  This wasn't a problem for my use of the data in a fielding metric, but anyone attempting to use the outfield data for other purposes would have to establish separate multipliers for 2008 to normalize the outfield data to pre-2008.      <br />
 <br />
Given the inherent inaccuracies of human observation of hit-ball locations and the recently reported greater than expected differences between STATS and BIS reported hit ball locations, I believe the normalized MLB data to be competitive with them for some purposes.  The next article will discuss the limitations of the data and present a framework for using the data to construct a fielding metric.  Below are home plate locations and distance multipliers that I am using for each field for both pre-2008 and 2008 hit location data. <br />
<br />
<h3 class="article_title">2005-2007 MLB HIT LOCATION FACTORS</h3><pre>TEAM              HOME-PLATE-X      HOME-PLATE-Y        DISTANCE-MULTIPLIER
ANA                    125.5             196.4                       2.70
ARI                    125.5             196.5                       2.55
ATL                    125.5             196.5                       2.55
BAL                    125.7             211.0                       2.52
BOS                    125.7             196.0                       2.75
CHA                    125.5             197.3                       2.70
CHN                    126.0             196.0                       2.73
CIN                    126.0             196.1                       2.74
CLE                    125.2             196.0                       2.75
COL2005-6              124.5             194.4                       2.77
COL2007                119.0             195.5                       2.62
DET                    125.9             198.7                       2.70
FLO                    125.8             197.0                       2.72
HOU                    125.2             196.2                       2.80
KCA                    125.5             197.3                       2.71
LAN                    125.8             195.8                       2.70
MIL                    126.4             194.9                       2.70
MIN                    125.1             196.2                       2.65
NYA                    125.7             195.2                       2.80
NYN                    124.6             195.4                       2.83
OAK                    126.1             197.2                       2.60
PHI                    126.4             198.8                       2.62
PIT                    125.2             197.5                       2.58
SDN                    125.4             197.3                       2.70
SEA                    125.3             197.2                       2.76
SFN                    125.7             195.0                       2.64
SLN                    126.0             197.6                       2.70
TBA                    125.4             198.0                       2.65
TEX                    126.5             195.4                       2.75
TOR                    126.2             197.0                       2.68
WAS                    126.8             197.5                       2.64</pre><br />
<br />
<h3 class="article_title">2008 MLB HIT LOCATION FACTORS</h3><pre>TEAM              HOME-PLATE-X      HOME-PLATE-Y          DISTANCE-MULTIPLIER
ANA                    125.5             198.8                       2.78
ARI                    125.1             201.5                       2.37
ATL                    126.8             201.3                       2.40
BAL                    125.9             201.5                       2.65
BOS                    124.6             200.4                       2.65
CHA                    125.0             200.2                       2.62
CHN                    125.4             201.1                       2.58
CIN                    126.3             200.8                       2.64
CLE                    125.5             202.4                       2.66
COL                    124.1             199.7                       2.71
DET                    125.5             201.0                       2.71
FLO                    124.5             200.1                       2.66
HOU                    125.2             201.6                       2.68
KCA                    124.6             195.1                       2.86
LAN                    125.7             199.1                       2.77
MIL                    125.1             198.1                       2.69
MIN                    125.2             197.7                       2.72
NYA                    125.7             197.4                       2.85
NYN                    125.3             197.1                       2.95
OAK                    125.5             200.4                       2.61
PHI                    125.5             200.5                       2.71
PIT                    125.3             202.3                       2.60
SDN                    126.2             199.4                       2.63
SEA                    125.8             199.8                       2.82
SFN                    125.8             197.9                       2.75
SLN                    125.7             195.4                       2.81
TBA                    123.5             199.4                       2.61
TEX                    125.5             199.8                       2.70
TOR                    126.7             197.0                       2.83
WAS                    125.1             200.5                       2.64</pre>Those of you who have explored the Gameday XML files know that hit locations are also given for the minor leagues.  Normalizing that data so that it could be incorporated into a play-by-play database for minor league players could potentially improve our projections of their future performances in the majors.  A process such as the one I used to normalize the major league data by park, using only infield ground ball outs, could certainly be used for minor league data as well.<br /><br /><a href="http://www.hardballtimes.com/main/downloads/" target="new">Click here</a> to learn about THT's download subscriptions.]]>

</description>
      <dc:creator>Peter Jensen</dc:creator>
      <dc:date>2009-03-10T05:01:15+00:00</dc:date>

    </item>

    <item>
      <title>Is seeing believing?</title>
       
<link>http://www.hardballtimes.com/main/article/is&#45;seeing&#45;believing/</link>
<guid>http://www.hardballtimes.com/main/article/is-seeing-believing/#When:04:02:15</guid>       
<description><![CDATA[A month ago, Greg Rybarczyk, the developer of the HitTracker website, published <a href="http://www.hardballtimes.com/main/article/seeing-is-believing/" target="new">Seeing is Believing</a> here at THT and made a convincing argument that the future of sabermetrics is in observational analysis.  His article contained his usual meticulous research and insightful observations.  I hope I am summarizing his argument correctly: that traditional stats fail to give a complete picture of what is happening on the field and need to be supplemented with additional data that can be gathered only by first-hand observation of the games.  His definition of observation included both human observation and observation through the innovative use of technology such as the SportVision PITCHf/x system.<br />
<br />
Two sentences in Greg’s article particularly intrigued me, as he uses them for the crux of his later arguments.  In the section called "Limitations of Existing Systems," Greg states: "The landing point of a fly ball can vary enormously due to the effects of wind, temperature, and altitude, so by itself, HITf/x will never be able to predict the landing point of a fly ball with any greater precision than we get today with the conceptual 'defensive zones.'"<br />
<br />
In the following sentence he says we'll have a similar problem providing useful information on ground balls.  My first thought on reading these sentences was "Do we have any idea what the current level of precision is on locating hit balls?"  It seems this issue has never been investigated.  So I did.<br />
<br />
Two commercial concerns, <a href="http://www.baseballinfosolutions.com/" target="new">Baseball Info Solutions</a> (BIS) and <a href="http://www.stats.com/" target="new">STATS Inc.</a>, track hit balls as part of their voluminous baseball data-gathering on behalf of various clients.  STATS told me that their hit ball data is gathered by a contract employee in the press box, and cross checked with a second employee scoring the game from video.  Any discrepancies are resolved by going back to the video or adding data from other employees who were gathering the data from the game for special projects.  As far as I could determine, BIS has a similar system of cross checks to ensure the accuracy demanded by its paying clients.<br />
<br />
Rybarczyk tracks the landing location of all home runs at his <a href="http://www.hittrackeronline.com" target="new">HitTracker web site</a>.  Last year, he began tracking some in-play balls as well.  His database for the hit balls of <a href="http://www.hardballtimes.com/main/stats/players/index.php?lastName=hunter" class="player">Torii Hunter</a> and <a href="http://www.hardballtimes.com/main/stats/players/index.php?lastName=jones" class="player">Andruw Jones</a> was the basis for his article “Of Home Runs and Free Agents” in the <i>2008 Hardball Times Annual</i>.  He uses careful observation from commercial video and detailed models of each stadium to ensure the accuracy that is the basis of his Website’s reputation.<br />
<br />
These three data sources give the distance to the nearest foot.  Greg and BIS give the angle to the nearest degree.  STATS places each hit in a zone of approximately 4 degrees.  For computational purposes, I gave each hit in a zone the angular measurement of the center of the zone.  <br />
<br />
MLB’s Gameday also tracks hit balls for use in its Gameday graphics and hit ball charts.   It, too, uses a contract employee in the press box to gather the raw data.  But the end use of graphical representation for entertainment purposes doesn’t require the same system of cross checks for accuracy.  Cory Schwartz, MLB.com’s director of stats, says that the contract employee’s training includes methods and emphasis on obtaining the best data possible, but Schwartz realizes the limitations of having only a single uncorroborated source.<br />
<br />
The data is recorded in an XY coordinate system because it integrates with the graphics, but this is also a source for error during the translation to a angle-distance format based on feet and degrees.  For hit balls that are actual hits, the system records where a ball is picked up by a player rather than where it hits the ground.  Even though this differs from other hit ball location systems, this was a conscious decision on what would be a most accurate graphical representation of the play for Gameday viewers.  <br />
<br />
I already had the Gameday hit locations for the last three years integrated with my <a href="http://www.retrosheet.org" target="new">Retrosheet</a> database.  Greg provided his spreadsheet for the hit locations for Hunter and Jones.  STATS graciously provided the same data for research purposes.  I paid a nominal fee for the BIS data.  I now had four independent observation sources for the locations of the 947 hit balls of those two players.<br />
<br />
Well, not quite.  Because home runs are problematic and because of Gameday’s unique method of recording the locations of hits, I decided to limit the data to balls in the field of play that were fielded for outs.  This still left 568 data points that could be compared from each source.<br />
<br />
Let me emphasize that none of the analysis that follows will help us know the actual landing locations of any hit ball or which of the four sources has the most accurate data.  We will never know precise hit locations until we have chips in the ball or triangulated data from cameras covering the whole field.<br />
<br />
What this analysis can give us is an idea of the confidence we can have in human observation as a source for hit ball locations.  If independent trained observers are in close agreement, then we can have more confidence in their observations.  The greater the difference in their observations, the less confidence we have.  With that caveat, we have table 1.<br />
<pre>Table 1. Hit Ball Location Standard Deviation Between Sources
First Source    Second Source    Distance SD feet   Vector SD degrees
BIS               GREG                 10.29            2.22
BIS               STATS                10.97            2.94
GREG              STATS                11.95            2.67
MLB               BIS                  12.72            3.04
MLB               GREG                 13.37            3.12
MLB               STATS                14.10            3.64</pre><br />
<br />
Depending on your expectations and point of view, these could be good numbers or bad numbers.  That the two observers who are closest in agreement can’t place the ball within 10 feet of each other more than 68 percent of the time may be a little discouraging to some.<br />
<br />
That BIS and STATS would disagree on what zone a ball is in more than 32 percent of the time might make some fielding analysts pause.  They will pause even longer when they investigate further and find out that the actual zone agreement is only 46.4 percent.  However, that Gameday is not further off the mark on this subset of data might inspire analysts to find new uses for its hit location data.<br />
<br />
Many of you may just find this data confusing.  What if we asked the question a different way?  What if we guessed the most likely spot for the actual hit ball location and asked, "What is the minimum in distance and degrees that has the two best observers within that error range at least 95 percent of the time?"  Isn’t that what we really want to know for the basis of a fielding metric?<br />
<br />
Let’s take the two observers in closest agreement, BIS and Greg, split the difference between them and call that the best guess of the actual hit location.  What is the minimum distance and degrees that will have 95 percent of both Greg’s and BIS’ observations included?  The answer is +-18 feet and +-4 degrees.  That’s a pretty big area.  It is two whole zones in width.<br />
<br />
Perhaps we can make an improvement by adding the third expert observer, STATS, and establishing the best guess of the actual hit location as the average of all three?  On The Book blog Tom Tango has suggested this “wisdom of the crowds” method of multiple observers recording the data.  Let’s try it and see what happens.  The answer is +-22 feet and +- 6 degrees.  <br />
<br />
Why did the error get larger?  That’s not what the wisdom of the crowds would predict.  Actually, it’s entirely logical and points out the fallacy of the "wisdom of the crowds" theory.  Adding an additional observer to the best two will always make the error greater.  Adding an observer can only move the original average further away from one of the two "best" observers, increasing the total error.<br />
<br />
To make an improvement, the third observer’s data points would have to be closer to the original best guess of the actual hit location, but if they were, they would also be closer to one or the other of the original two "best" observers, which is a contradiction because we defined the original two "best" observers as the ones whose data was closest.  It doesn’t matter if you have three observers or 3,000, the composite data will never have any less error than that of the two closest.  Having many observers is only useful for finding those two best observers.  <br />
<br />
It turns out that +-18 feet and +-4 degrees is the best we can do for these four observers and given the redundancy built into STATS and BIS, Greg’s thoroughness, and the high motivation for accuracy of all three sources, it probably is very close to the best we can expect for any human observers.  Whether HITf/x will be able to be more precise remains to be seen since the system is not yet a reality.<br />
<br />
For those of you not familiar with the proposed HITf/x system, it would use the same hardware as the existing PITCHf/x system and basically the same software, but modified to track the outgoing hit ball instead of the incoming pitched ball.  It would provide a direct measure of the speed off the bat, the initial vertical angle of the hit ball, and the initial horizontal angle.  The landing location of the hit ball would have to be computed from these initial inputs since the existing cameras do not cover the entire field.<br />
<br />
It is the precision of this computed landing location that Greg was skeptical of in his article.  But since it is still in the conceptual stage and since SportVision has been unusually open to suggestions as to how it will be structured, we have an extraordinary opportunity to make HITf/x as good as it can be.  That means incorporating Greg’s suggestions for including wind, temperature and altitude factors.<br />
<br />
HITf/x also would have some additional benefits for fielding analysis.  Accurate speed off the bat data is useful for determining whether one pitcher’s hit balls are easier to field than another’s.  During this study I also found that there is not a consensus as to whether a hit ball is a fly, a liner or even a ground ball.  An objective definition of fly balls and line drives should be possible using the initial vertical angle and the speed off the bat as parameters.<br />
<br />
Given the lack of precision of our current data, HITf/x certainly deserves a chance.<br /><br /><a href="http://www.hardballtimes.com/main/downloads/" target="new">Click here</a> to learn about THT's download subscriptions.]]>

</description>
      <dc:creator>Peter Jensen</dc:creator>
      <dc:date>2008-04-01T04:02:15+00:00</dc:date>

    </item>


    </channel>
</rss>