Extra plays made: Should we change our point of view?by Max Marchi
August 05, 2011
Psst... Let me tell you this: You can't compare fielders' ratings with any of the existing defensive metrics.
How do you build a defensive metric?
Take a player, find out how many plays an average player would have made given the same chances he had, subtract those (expected) plays from the (actual) plays made by the guy you are evaluating, and you are done.
To my knowledge that's ow the most known fielding metrics are built.
- Mitchell Lichtman's UZR: check.
- Sean Smith's TZR: check.
- Dan Fox's SFR: check.
- David Pinto's PMR: check.
- The Fielding Bible plus/minus: check.
(Note: Actually, most of the systems perform the extra step of converting plays made into runs saved.)
It seems a consensus has been reached on how a fielding evaluation system should look. Thus, we just have to wait for the data to get better and better for the defensive ratings to get more and more reliable.
The data are getting better. Baseball Info Solutions, with its crew of video scouts, yearly improves its database either by introducing more rigorous quality checks, or by adding new information, such as hang time, fielders' positioning, and so on. Meanwhile, Sportvision is trying to capture the position of the ball and the players at any moment during every ballgame, with its multiple camera system.
When presented with a sample of data from Sportvision FIELDf/x, Greg Rybarczyk immediately proposed a defensive metric based on those data. You can read about his True Defensive Range in The Hardball Times Baseball Annual 2011, in the article he co-authored with Kate McSurley (An introduction to FIELDf/x). However he simply applied the basic steps outlined at the beginning of this article to what appears to be a very detailed and accurate database.
Indirect vs. direct standardization
Epidemiologists had been using the indirect standardization method for a long time when advanced baseball fielding metrics began making their appearances. Indirect standardization is a way of taking into account the characteristics of a population (its age structure, for example) when looking at the frequency of an event (let's say the mortality due to lung cancer).
It works like this: You take the mortality rates of a standard population (say the entire nation) and you apply the age structure of the population under scrutiny (say a county); this way you get the expected number of deaths, to be compared to the number of deaths that actually occurred.
Sounds familiar? Yeah, it's the same thing every major fielding metric does, except epidemiologists prefer dividing observed events by expected events, instead of subtracting them, as baseball analysts do.
Epidemiologists also use an alternate method: direct standardization.
Applying direct standardization to fielding evaluations would consist of the following steps: Take the distribution of chances the average player faces, calculate the expected plays a given player would make given those chances, subtract the expected plays from the actual plays.
What's the difference between the two methods?
Indirect standardization (or what baseball analysts are currently using) assigns the same chances faced by the player under evaluation to the average fielder in order to obtain the expected plays. If we are measuring the defensive prowess of a shortstop who faced 150 grounders to his left, 250 straight at him, and 100 to his right, we would calculate the expected plays by assigning 150 grounders to the left of the average shortstop, 250 straight at him and 100 to his right.
Direct standardization assigns the same distribution of chances faced by the average player to the player under evaluation. Let's say the average shortstop has to deal with 200 grounders hit straight at him, 150 to his left and 150 to his right. Thus we would calculate the expected plays for our shortstop by assigning him 40 percent of balls straight at him and 30 percent both to his left and to his right.
Why have baseball analysts chosen the indirect standardization way en masse?
Pros and cons of the two methods
When you calculate the expected plays in the direct standardization method, you base your result on a very limited sample. In fact, in our example of a metric which simply divides the opportunities in three buckets (left, straight, right), we would use the percentage of balls the shortstop under scrutiny turns into outs for each bucket; thus the expected plays are based on a single player sample.
On the other hand, indirect standardization uses information from every shortstop when calculating the expected plays. Epidemiology textbooks suggest to use the indirect method for small populations, when the stratum-specific rates* are unreliable. (* In this case the stratum-specific rates would translate in out-conversion rates to the player's right, left and center)
Thus we have gone the right way.
Except, epidemiology textbooks give the following warning: When you use the indirect method, you can compare a population to the standard population, but you can't compare two populations between them. Translate it to baseball defense: You can compare a shortstop with the average shortstop, but you can't compare two shortstops between them.
Wait! Am I saying that a plus-15 shortstop (a shortstop with 15 more plays made than expected) has not necessarily performed better than a plus-10 shortstop, even if they had exactly the same amount of opportunities? (Please note the bold above: it's "performed better" rather than "is a better defensive player," because we don't want to enter the perilous terrain of trying to guess true talent on the basis of performance.)
Let's pretend we know exactly the true talent of two shortstops. Player A converts 90 percent of balls hit straight at him into outs, and 37 percent of both balls to his right and to his left. Player B's rates are 85 percent to the middle and 36 percent to both sides.
The average shortstop's are 80-35-35. Thus Player A is superior to Player B on every batted ball and both are above average.
Let's also assume they perform exactly according to their respective skills.
Player A faces 100 grounders straight at him, 100 to his right and 400 to his left. Do the math and you get 275 plays for him versus 255 for the average shortstop, a net of plus-20 plays.
Player B faces 400 grounders straight at him, 100 to his right and 100 to his left. With the necessary multiplications you get 412 successful plays for him, 390 for the average shortstop—a plus-22 for Player B.
Despite A being superior to B and both having faced 600 total chances, and having performed according to their skills, Player B is rated higher.
The only reason for this outcome is the different set of opportunities, something beyond the players' control.
Okay, I can hear you say: "If the pitchers playing with A allow an inordinate amount of balls to his left, he should position himself accordingly—that's part of his defensive duties as well!" Right. But you could substitute the left/middle/right buckets with something like hard/regular/soft hit or whatever you want (you can even play God and say easy/medium/difficult).
So, let me reiterate the issue. When using indirect standardization (i.e., when using whatever existing fielding metric), you are entitled to say that both Player A (+20 plays) and Player B (+22) performed better than the average shortstop, but there is no way you can infer Player B performed two plays better than Player A. (In fact, we saw that Player A actually performed better than Player B).
What would happen with direct standardization?
The average shortstop, facing 600 balls evenly distributed among center, right and left, would record 300 successful plays. Player A, given the same distribution of 600 chances, would convert 328 of them, or plus-28 over the average shortstop. Player B, again with an equal set of opportunities, would record 314 outs, or plus-14.
With the direct standardization, the real ranking emerges.
Should we make the switch?
We are in a conundrum. If we move to direct standardization, we need a reliable estimate of a single player's success rate on, for example, balls hit softly at an angle of 10-15 degrees. Chances are, for some buckets you have to rely on as much as one or two chances—even no chances at all.
Actually this issue would be somewhat mitigated by smoothing techniques. In fact, the success rate of the above example bucket is surely correlated with the success rate on balls softly hit at either an angle of 5-10 degrees or 15-20 degrees, and also with the success rate on balls hit at an angle of 10-15 degrees with medium force. Nearly every opportunity faced by a player can contribute useful information for each considered bucket—with decreasing weight as the opportunities become more and more different. (Ideally we treat data as continuous, rather than artificially split opportunities into buckets)
However, maintaining the indirect standardization method exposes us to the risk of improperly ranking players, as I have shown with an example.
You may have noticed that in order to get the paradoxical result, the players in the example face two completely different distributions of chances. When the distributions are similar (as should be the case for fielding chances) the rankings resulting from an indirect standardization would not be too far from reality. But, if the players face similar distributions of opportunities no standardization is needed; i.e. the success/opportunity ratio is sufficient, and the labor of classifying batted balls by angle, velocity and so on, is unnecessary.
I believe fielding metrics should shift to the direct standardization method when data become more objective, detailed and unbiased. Until then the indirect standardization is an improvement over no standardization at all when players face different set of opportunities (but that's when improper ranking might come out).
Thus, when you look at fielding leader boards, keep the following in mind. If a player has a positive rating he has performed better than average; if a player has a negative rating he has performed worse than average. But there's no way you can tell who has performed better between two better-than-average (or worse-than-average) players.
References and Resources
I conceived this article a long time ago, while dealing with standardization methods at work. This post by Tom Tango, which elicited a lot of comments, also spurred the work.
After creating a baseball rendition of The Beatles' Sgt. Pepper cover, Max began his baseball writing because he needed an excuse to show the picture. He wrote for an Italian audience for six years before making the jump to The Hardball Times. You can contact him by e-mail.