Psst… Let me tell you this: You can’t compare fielders’ ratings with any of the existing defensive metrics.

How do you build a defensive metric?

Take a player, find out how many plays an average player would have made given the same chances he had, subtract those (expected) plays from the (actual) plays made by the guy you are evaluating, and you are done.

To my knowledge that’s ow the most known fielding metrics are built.

{exp:list_maker}Mitchell Lichtman’s UZR: check.

Sean Smith’s TZR: check.

Dan Fox’s SFR: check.

David Pinto’s PMR: check.

The Fielding Bible plus/minus: check. {/exp:list_maker}

(Note: Actually, most of the systems perform the extra step of converting plays made into runs saved.)

It seems a consensus has been reached on how a fielding evaluation system should look. Thus, we just have to wait for the data to get better and better for the defensive ratings to get more and more reliable.

The data are getting better. Baseball Info Solutions, with its crew of video scouts, yearly improves its database either by introducing more rigorous quality checks, or by adding new information, such as hang time, fielders’ positioning, and so on. Meanwhile, Sportvision is trying to capture the position of the ball and the players at any moment during every ballgame, with its multiple camera system.

When presented with a sample of data from Sportvision FIELDf/x, Greg Rybarczyk immediately proposed a defensive metric based on those data. You can read about his True Defensive Range in The Hardball Times Baseball Annual 2011, in the article he co-authored with Kate McSurley (*An introduction to FIELDf/x*). However he simply applied the basic steps outlined at the beginning of this article to what appears to be a very detailed and accurate database.

### Indirect vs. direct standardization

Epidemiologists had been using the **indirect standardization** method for a long time when advanced baseball fielding metrics began making their appearances. Indirect standardization is a way of taking into account the characteristics of a population (its age structure, for example) when looking at the frequency of an event (let’s say the mortality due to lung cancer).

It works like this: You take the mortality rates of a standard population (say the entire nation) and you apply the age structure of the population under scrutiny (say a county); this way you get the expected number of deaths, to be compared to the number of deaths that actually occurred.

Sounds familiar? Yeah, it’s the same thing every major fielding metric does, except epidemiologists prefer dividing observed events by expected events, instead of subtracting them, as baseball analysts do.

Epidemiologists also use an alternate method: **direct standardization**.

Applying direct standardization to fielding evaluations would consist of the following steps: Take the distribution of chances the average player faces, calculate the expected plays a given player would make given those chances, subtract the expected plays from the actual plays.

What’s the difference between the two methods?

Indirect standardization (or what baseball analysts are currently using) assigns the same chances faced by the player under evaluation to the average fielder in order to obtain the expected plays. If we are measuring the defensive prowess of a shortstop who faced 150 grounders to his left, 250 straight at him, and 100 to his right, we would calculate the expected plays by assigning 150 grounders to the left of the average shortstop, 250 straight at him and 100 to his right.

Direct standardization assigns the same distribution of chances faced by the average player to the player under evaluation. Let’s say the average shortstop has to deal with 200 grounders hit straight at him, 150 to his left and 150 to his right. Thus we would calculate the expected plays for our shortstop by assigning him 40 percent of balls straight at him and 30 percent both to his left and to his right.

Why have baseball analysts chosen the indirect standardization way en masse?

### Pros and cons of the two methods

When you calculate the expected plays in the direct standardization method, you base your result on a very limited sample. In fact, in our example of a metric which simply divides the opportunities in three buckets (left, straight, right), we would use the percentage of balls the shortstop under scrutiny turns into outs for each bucket; thus the expected plays are based on a single player sample.

On the other hand, indirect standardization uses information from every shortstop when calculating the expected plays. Epidemiology textbooks suggest to use the indirect method for small populations, when the stratum-specific rates* are unreliable. (* In this case the stratum-specific rates would translate in out-conversion rates to the player’s right, left and center)

Thus we have gone the right way.

Except, epidemiology textbooks give the following warning: When you use the indirect method, you can compare a population to the standard population, but you can’t compare two populations between them. Translate it to baseball defense: You can compare a shortstop with the average shortstop, but you can’t compare two shortstops between them.

Wait! Am I saying that a plus-15 shortstop (a shortstop with 15 more plays made than expected) has not necessarily **performed better** than a plus-10 shortstop, even if they had exactly the same amount of opportunities? (Please note the bold above: it’s “performed better” rather than “is a better defensive player,” because we don’t want to enter the perilous terrain of trying to guess true talent on the basis of performance.)

### An example

Let’s pretend we know exactly the true talent of two shortstops. Player A converts 90 percent of balls hit straight at him into outs, and 37 percent of both balls to his right and to his left. Player B’s rates are 85 percent to the middle and 36 percent to both sides.

The average shortstop’s are 80-35-35. Thus Player A is superior to Player B on every batted ball and both are above average.

Let’s also assume they perform exactly according to their respective skills.

Player A faces 100 grounders straight at him, 100 to his right and 400 to his left. Do the math and you get 275 plays for him versus 255 for the average shortstop, a net of plus-20 plays.

Player B faces 400 grounders straight at him, 100 to his right and 100 to his left. With the necessary multiplications you get 412 successful plays for him, 390 for the average shortstop—a plus-22 for Player B.

Despite A being superior to B and both having faced 600 total chances, and having performed according to their skills, Player B is rated higher.

The only reason for this outcome is the different set of opportunities, something beyond the players’ control.

Okay, I can hear you say: “If the pitchers playing with A allow an inordinate amount of balls to his left, he should position himself accordingly—that’s part of his defensive duties as well!” Right. But you could substitute the left/middle/right buckets with something like hard/regular/soft hit or whatever you want (you can even play God and say easy/medium/difficult).

So, let me reiterate the issue. When using indirect standardization (i.e., when using whatever existing fielding metric), you are entitled to say that both Player A (+20 plays) and Player B (+22) performed better than the average shortstop, but there is no way you can infer Player B performed two plays better than Player A. (In fact, we saw that Player A actually performed better than Player B).

What would happen with direct standardization?

The average shortstop, facing 600 balls evenly distributed among center, right and left, would record 300 successful plays. Player A, given the same distribution of 600 chances, would convert 328 of them, or plus-28 over the average shortstop. Player B, again with an equal set of opportunities, would record 314 outs, or plus-14.

With the direct standardization, the real ranking emerges.

### Should we make the switch?

We are in a conundrum. If we move to direct standardization, we need a reliable estimate of a single player’s success rate on, for example, balls hit softly at an angle of 10-15 degrees. Chances are, for some buckets you have to rely on as much as one or two chances—even no chances at all.

Actually this issue would be somewhat mitigated by smoothing techniques. In fact, the success rate of the above example bucket is surely correlated with the success rate on balls softly hit at either an angle of 5-10 degrees or 15-20 degrees, and also with the success rate on balls hit at an angle of 10-15 degrees with medium force. Nearly every opportunity faced by a player can contribute useful information for each considered bucket—with decreasing weight as the opportunities become more and more different. (Ideally we treat data as continuous, rather than artificially split opportunities into buckets)

However, maintaining the indirect standardization method exposes us to the risk of improperly ranking players, as I have shown with an example.

You may have noticed that in order to get the paradoxical result, the players in the example face two completely different distributions of chances. When the distributions are similar (as should be the case for fielding chances) the rankings resulting from an indirect standardization would not be too far from reality. But, if the players face similar distributions of opportunities no standardization is needed; i.e. the success/opportunity ratio is sufficient, and the labor of classifying batted balls by angle, velocity and so on, is unnecessary.

I believe fielding metrics should shift to the direct standardization method when data become more objective, detailed and unbiased. Until then the indirect standardization is an improvement over no standardization at all when players face different set of opportunities (but that’s when improper ranking might come out).

Thus, when you look at fielding leader boards, keep the following in mind. If a player has a positive rating he has performed better than average; if a player has a negative rating he has performed worse than average. But there’s no way you can tell who has performed better between two better-than-average (or worse-than-average) players.

**References & Resources**

I conceived this article a long time ago, while dealing with standardization methods at work. This post by Tom Tango, which elicited a lot of comments, also spurred the work.

Sean Smith said...

“Does this have implications for comparing any non-fielding metrics?”

It certainly could. Player A has a 900 OPS vs lefties, and 700 vs righties. Player B is 875/675. Player A is the better hitter against either type of pitcher, right?

But if he plays full time, he’ll see righties about 70% of the time. If player B is platooned, he might see lefties 60%, righties 40%.

Player B will show you a 795 OPS, and player A only 760. People not paying attention will start a “free player B!” campaign. If successful, they will be very disappointed when he’s exposed in regular play.

Sean Smith said...

Come to think of it, this problem is much more likely to show up in fielding or pitching stats, as there is active selection of players for the roles they are best suited for.

If a fielder is relatively better going to his left than to his right, there is no way to leverage that.

Mike Fast said...

Sean,

“If a fielder is relatively better going to his left than to his right, there is no way to leverage that.”

Not even, for example, by playing him off the line at first base and putting Mark Ellis on his right?

Dave Studeman said...

Or Chase Utley?

BenJ said...

Max,

Great point, essentially highlighting Simpson’s Paradox in baseball. I’ve been bothered by the platoon splits issue for a while (as Sean mentions above), and I’m curious as to which projection systems might already take this into account.

It also carries over to pitchers, particularly relievers changing roles in the bullpen.

I’ve also done some (unpublished) work on this effect in defense, as you discuss here. It really depends what you’re trying to do with defensive metrics. Are you trying to value a player’s past contributions, or predict his potential value in a different setting going forward?

The anecdotal example I always think of is Nate McLouth, back in his 2008 Gold Glove season in Pittsburgh in which he rated dead last in our (BIS) Defensive Runs Saved. He played (and still plays) notoriously shallow, but he was behind a terrible pitching staff that gave up a lot of deep fly balls. Because McLouth played so shallow, these turned into doubles and triples. Behind Atlanta’s very good pitching staff, however, McLouth hasn’t rated nearly as poorly because fewer doubles and triples have gone over his head.

Greg Rybarczyk said...

Interesting stuff, Max. I think you make a good point.

I’ll also agree with Ben that it depends on what you’re trying to do, so perhaps the best thing would be to have the ability to look at the metrics both ways, and use whichever one is most appropriate. Indirect maintains the connection to the actual population of plays that the defender faced, and so might be better when trying to describe the actual performance and impact of that player. Direct might be better for comparisons…

Joe Arthur said...

Shane Jensen’s SAFE might employ the “direct standardization” approach.

One point you make is worth hightlighting: ” if the players face similar distributions of opportunities no standardization is needed; i.e. the success/opportunity ratio is sufficient, and the labor of classifying batted balls by angle, velocity and so on, is unnecessary.” Similar distributions would also obviate concerns about differences in positioning.

A question worth further research is how much variation in opportunity there is when fielders have a full season’s worth of opportunities.

jmr said...

I like this article. I’m confused by the example and I’m worried I’m missing the difference between indirect and direct.

First it says pretend we know the true talent level of the players. Then it presents a distribution of events that lead to the performance of B scoring better than the performance of A. Later it presents a different distribution of events that lead to A scoring better than B, and apparently we would choose that 2nd distribution because it represents the average over all players in the population.

First, couldn’t the average distribution still lead to the result that B > A? If it can’t happen, then why not? Or would that only happen if we were incorrectly interpreting the true talent levels and in fact B > A after all?

Finally, is the average distribution actually appropriate to use? The author mentioned positioning, which I think would be a big problem with this. Say B knows his range is much better to the right and cheats left, reducing his left chances and increasing his right. If you use an average distribution then that is canceled out, unless the buckets are based on distance from the fielder’s starting position. If it was based on distance from starting position that would be comparing apples and oranges to some extent.

Like I said, I think I’m just confused. I do see how you can’t compare B and A using the current methods, which was a surprise to me. Does this have implications for comparing any non-fielding metrics?

Tom M. Tango said...

There’s no question that the huge drawback in the direct method is prorating the observed success rate to a sample size that is greater than the number of observations.

You have a 100% success rate (1 for 1) in a zone where the average SS has 50 plays, and now you are going to prorate him to 50 for 50.

And if he was 0 for 1, then you are prorating him to 0 for 50.

The reality is that this data point is really useless to you. You apply Bayes, and the 1 for 1 probably gives you 28 for 50, with 1 SD = 15, and the 0 for 1 gives you 22 for 50, with 1 SD = 15.

Obviously, the larger the bins, and the more data, the less this problem exists. But the less this problem exists, then the less need to decide between indirect and direct.

You see this problem especially with WOWY, when you look at Cal Ripken.

Mike Humphreys said...

Shane Jensen’s SAFE accomplishes the goal of direct standardization and he uses smoothing.

On an even more basic level, a great shortstop behind a ground ball staff will be overrated under the indirect method. In particular, great shortstops who were such weak hitters they probably were allowed to play only for ground ball pitchers (Rey Sanchez) are overrated.

The problem with direct standardization is that it is more complicated to compute. It would be interesting to find out how much practical impact direct-v-indirect makes per season and over a career.

Mike said...

One thing I have always wondered with fielding stats is how defensive shifts work in the data. If a 3rd baseman is playing shortstop and gets a ball near 2b, can this give him a boost, since it is so far out of normal range? This is probably not a big deal in most cases, but in some divisions, like the AL East where every power-hitting lefty gets a shift, it could make players look better defensively. I admit I am naive on this, and it may already be resolved in the stats, just wondering if they are and how.

Brian Cartwright said...

A correction to my previous comment – by using 1. runner held at first (outs irrelevant) and 2. dp situation (outs considered) those 4 combinations cover all of the 24 base/out situations, so I end up with 8 bins instead of 48.

It’s been a while since I looked at Jensen’s SAFE. I recall that he uses smoothing, but thought that had been of the expected, not the observed values.

Tango mentions applying Bayes to each observed bin, which is basically regressing. A problem this brings up is that if each bin is regressed, then the bins summed, you end up regressing the final value as many times as you have bins.

With WOWY, I have used a method which again is de facto regression, that whenever the “without” portion comes back empty, I can replace it with league average rates.

Brian Cartwright said...

I’m scribbling notes on paper so I get an understanding of this…Max or anybody, please correct me if I am wrong here

Indirect = sum((obs_rate-exp_rate)*obs_opp)

Direct = sum((obs_rate-exp_rate)*exp_opp)

The difference is which amount of opportunities the difference in rates is multiplied by – the observed distribution or the expected.

One of the problems with having many bins is that some will have very few opportunities. The observed rate in some bins can be very unreliable (example 1 success in 1 chance). With the indirect method, that’s not a problem as the large difference in rates (observed minus expected) is attenuated by weighting it with the small number of opportunities it was based on. With the direct method, a rate based on a small n is multiplied by a possibly much larger expected n, which is usually a no-no, because it magnifies the unreliability of the small sample.

Measuring defense in real life, with a decent sample size the spread of opportunities between bins should not be that extreme as in the example here. As Sean noted above, it’s much more of a problem with platoon splits as there is a selection bias – a manager can choose to avoid some bins, but not so much on defense.

Another method is to avoid defining a multitude of bins. Find a balance between how detailed a breakdown is necessary to identify true differences in the expected outcomes of the opportunities presented each fielder, without getting so many small bins that biases and distortions start to appear.

For example, one part of the Oliver Fielding Runs published here at the Hardball Times Forecasts is measuring the range of infielders – the rate at which ground ball hits to the outfield are allowed by each infielder. Infielders are positioned differently based on the bases occupied and the number of outs. That, along with the handedness of the batter, can create large differences in expected hit rates. A first thought might be to bin the ground balls by each combination of outs and bases occupied (RE24). However, instead of using 48 bins (8 base states * 3 out states * 2 bat hands) I saw that the main factors for infield positioning were 1. Is the runner being held on 1b? and 2. is it a dp situation?. A true or false on each of those gives 4 combinations instead of the 8 base states, cutting the number of bins in half while retaining much of the information as to expected hit rates.

As for Nate McLouth, if he knew his pitchers on the Pirates allowed a lot of deep fly balls, why play so shallow?

Max Marchi said...

“There’s no question that the huge drawback in the direct method is prorating the observed success rate to a sample size that is greater than the number of observations.”

I’m not sure I have made the following clear:

If the average shortstop faces a 100-200-100 (L-C-R) distribution of batted balls and the shortstop under scrutiny faces a 20-10-10 distribustion, we assign him a 10-20-10 distribution (to match his actual total of 40 opportunities), not a 100-200-100.

So, it’s not very likely he will have one chance pro-rated to 50.

(So Brian, yes, the formulas are correct, provided that exp_opp in Direct is pro-rated to the total number of batted balls faced by the player).

Yes, the great drawback of direct is that you put your money on severeal rates based on a small number of chances.

On the other hand, with the indirect you put your money on a single distribution of chances.

tangotiger said...

Max, if you have a small number of bins, then it’s you won’t get the effect I’m saying. But, if you have smaller number of bins, then you could get the effect I’m saying. Suppose a SS faces 300 balls in play, at this frequency, with the pro-rated league average in parens:

Bin OurSS (league prorated)

—-——-————————

A 50 (20)

B 80 (30)

C 100 (100)

D 40 (70)

E 30 (80)

If he happens to be great in Bin A, say he’s at 100% out rate, he’s only going to get n=20 for that, dropping his 50-50 down to 20-20.

If he happens to be terrible in Bin E, say he’s at 50% out rate, his 15-30 will get bumped up to 40-80.

The problem is that all these things are observations. And you cannot simply increase his observed rate up to the number of trials you need.

The indirect method is simply answering a specific question: how many more plays did OurSS make than an average SS would have made? It’s a specific question with a specific answer.

The direct method is saying: IF OurSS had a league average frequency distribution of plays, how many outs WOULD he have made, had he continued to play as he did? But, this presupposes that his observed rate is actually some sort of true rate, that it comes with no random variation.

So, I reject the direct method out of hand. Not unless you use Bayes. And if you are going to use Bayes, then you have to include an uncertainty level in your estimate of the “extra” plays that you are giving him to fill up the bins where he is short.