During this year’s MIT Sloan Sports Business Conference, Rob Neyer told attendees that the evaluation of major league player hitting, pitching and fielding performance has been adequately addressed, and Bill James agreed with him. Rob could say this, and Bill could agree, because what had been the “holy grail” problem in player evaluation—fielding—has, at least in theory, a solution for contemporary players: the careful evaluation of multi-parameter zone data. What Rob didn’t mention was how useful the combination of different systems actually is.
When I say this mouthful, I’m referring to counts of all batted balls in play (BIP), whether they were caught and by whom, and precise or approximate measures of location (“zone” or “vector”), trajectory (grounder, line drive, fly ball, pop-up), speed, and other parameters designed to help measure how difficult it was to field each BIP.
Multi-parameter zone data (let’s just call it Zone Data) is proprietary, and designed to be sold to major league teams or used as the basis for consulting contracts with major league teams. Even if an ordinary fan were willing to buy it—at a cost of several thousand dollars or more—he would be prohibited from sharing any significant amount of it with others.
Since Zone Data cannot be independently audited, it is good that we now have two sources of Zone Data—Baseball Info Solutions (“BIS”) and STATS, Inc. (“STATS”), as well as several publicly reported ratings based on each data set.
When several ratings systems based on two different data sets agree on a fielder rating, we have much more reason to believe the rating. When they don’t, we have an opportunity to analyze how the differences come about, and perhaps learn how to use Zone Data better.
The four primary providers of ratings based on Zone Data are John Dewan (creator of the Plus/Minus system), Mitchel Lichtman (Ultimate Zone Rating, or UZR), David Pinto (Probabilistic Model of Range, or PMR) and Shane Jensen (Spatial Aggregate Fielding Evaluation, or SAFE). UZR is derived from STATS data; the other systems are based on BIS data.
What follows is a close analysis of results under all four rating systems in the outfield, along with results under Defensive Regression Analysis (DRA), the non-zone system I developed in 2003 using only traditional, publicly available pitching and fielding statistics.
What I found
There are larger differences between BIS and STATS in outfield rankings than in infield rankings. In particular, my study found that:
- Outfielder runs-saved/allowed ratings based on BIS data had a .60 to .70 correlation with ratings based on STATS data.
- Ratings based on BIS data had a .68 to .69 correlation with DRA.
- The correlations between ratings based on STATS and DRA were the weakest, ranging from .42 to .50 (see my note at the end of the article).
What is the minimum standard for a satisfactory correlation between fielding systems? The ultimate answer is obviously, “It depends,” but here is a good starting point for the discussion:
People talk about strong or weak correlations, but authors seem reluctant to clarify what they mean. I’m fearless, or foolish, so here (is) my rough (definition of) . . . strong correlation: r of 0.70 to 1.00…” Ken Ross, A Mathematician at the Ballpark: Odds and Probabilities for Baseball Fans 127 (Pearson Education 2004).
As will be shown below, a simplified, no-park-factors version of UZR (which uses STATS) achieves a .70 correlation with SAFE (which uses BIS data and does not include park factors).
Correlation measures the extent to which fielding systems agree on the relative value of fielders. The standard deviation in ratings measures the impact of good and bad fielding. All the systems other than SAFE show a standard deviation in runs-saved/allowed per 1,450 innings played of between 13 and 15, which is broadly consistent with studies published by Tom Tippett on his website. The standard deviation under SAFE is 10 runs-saved/allowed per 1,450 innings played, for reasons explained below.
I imagine most readers will be surprised that the match between the most carefully prepared BIS- and STATS-based ratings is not better. My guess is that the problem stems from the “stringers” who code the parameters for outfield BIP. These people must estimate distances (“Was that 270 feet or 300 feet?”) and angles (“Was that left-center or left-left-center?”), and the estimates are farther off when the ball is hit deeper in the field. In a nutshell, it can be hard to interpret three-dimensional information (what happened on the baseball field) to a two-dimensional plot.
At the SABR convention, one presentation may have held the key to resolving this issue. Matt Thomas’ poster presentation featured a system that would use a simple matrix equation (based on projective geometry) to convert the (x, y) coordinates of where the ball appears to land or be caught on the two-dimensional television screen into exact (x1, y1) coordinates on the plane of the actual playing field extending out into space. I believe this approach could improve both STATS’ and BIS’ fielding statistics significantly.
Until improvements are made in the collection of Zone Data, I believe it is probably best to look to a new rating system based on a 50/50 average of David Pinto’s PMR (again, from BIS) and Mitchel Lichtman’s UZR (STATS), where each is calculated using a simplified and consistent methodology. This average, which I’ll call “Simplified UZR-PMR Average” (“SUPA”, as in “Super” ratings based on Zone Data), yields ratings that have standard deviations consistent with Mitchel’s and Tom’s research, and appear to match well (and better than either rating alone) with detailed subjective evaluations reported in The Fielding Bible.
DRA has a .68 correlation and nearly exact standard deviation match with SUPA. SUPA and DRA outfielder ratings, along with relevant Fielding Bible commentary, will be presented in this article. For reasons detailed below, outfielder Plus/Minus ratings published in The Fielding Bible are incorrect. I believe that Shane Jensen’s SAFE system is being further refined, and it may well become the standard BIS-based rating system.
This part of the study can be reproduced and verified by other analysts within the licensing constraints for BIS and STATS data; that is, results (not underlying data) as reported by Shane Jensen, David Pinto and Mitchel Lichtman to me are disclosed below. (The only changes I make to the data elsewhere in the article are to put them on a common scale of runs-saved/allowed per 1,450 innings played; no data points are discarded.) In addition, the DRA formulas used to generate the outfielder ratings will also be disclosed tomorrow, so anyone with a spreadsheet program who is willing to copy and paste data from Retrosheet will be able to replicate the DRA numbers.
Results in the middle infield are not shown because they are uncontroversial. Plus/Minus (BIS), which is correctly calculated in the infield, and UZR (STATS) had a .81 correlation and very close standard deviation match. DRA had a .79 correlation and nearly identical standard deviation match with Plus/Minus. Once again, the match between DRA ratings and UZR was weaker (correlation about .75) than the match between DRA and the BIS-based rating (Plus/Minus). Nevertheless, I think all analysts can conclude that the problem of measuring the impact of middle infielders has been solved for all practical purposes.
Thus, outfield and middle infield DRA ratings have a combined correlation of .74 (and virtually exact standard deviation match) with Zone Data, if Plus/Minus is used in the infield and SUPA in the outfield.
At third base, the correlation between Plus/Minus and UZR was .65; between Plus/Minus and DRA only .50. Third base is difficult to evaluate, mainly because the “slice” of the playing field is so much narrower than for middle infielders and outfielders, so the sample of BIP for any major league third baseman over the course of a season is highly variable. (For you über-nerds, think of doing a Monte Carlo numerical integration of a very narrow interval of a function compared to an interval two or three times wider, and consider the relative rates of convergence.) Although I haven’t checked, I assume the same results apply at first base, for the same reason.
Another problem is that both Zone Data and DRA shown smaller standard deviations in performance at third base, and so the inherent noise in fielding results may swamp the signal. It is my impression that major league teams aren’t as willing as they used to be to put a weak fielder at third base to get another bat in the lineup, so the true variance in fielding ability is smaller. Back in the ’70s and ’80s, for example, hitters such as Tony Perez, Bob Horner, Pedro Guerrero and Bobby Bonilla played third, making it easier for Mike Schmidt to dominate. I will be developing new DRA methodologies based on Retrosheet play-by-play data, and perhaps that will improve things at third and first.
Bottom line: If I were a general manager of a baseball team, I would purchase both BIS and STATS data, make sure that ratings under both systems were being calculated in at least approximately consistent ways, and take the average of the resulting ratings, particularly at the corners and in the outfield. DRA would serve as a very good “sanity check” for middle infielders and outfielders, and possibly could be adapted to evaluate minor league or foreign fielders.
References & Resources
It is unclear why I found such a low correlation between the STATS system and DRA. A similar test of STATS-based UZR and DRA outfielder ratings for 2001-03 resulted in an unedited correlation in the outfield of .71. A 1999-2001 test of DRA and UZR showed a .77 correlation (that’s a PDF file. Scroll down to pages 34-36).