There are a number of metrics that attempt to combine a player’s offensive and defensive value into one, all-around “uberstat.” I shall mention a few of the more prominent at the risk of leaving out one or more worthy candidates, just to give a survey of the field:
I am not trying to argue for one system in particular, or to get too far into the weeds on any one metric in particular. Instead, what I want to do is go over the theory and principles behind a total player evaluation system, and to present an overview of the substantial areas where such systems can disagree with each other.
I don’t claim to be a nonpartisan – I’ll tell you upfront that I’m partial to Tango’s WAR system. And I have strong opinions on most of these points, and if I wasn’t confident in those opinions I probably shouldn’t write articles like this. I will endeavor greatly to be fair, but fairness is not the same as being noncommittal. You’ve been forewarned.
In later articles, we’ll get down to some of the gritty technical details. For the time being, consider this a statement of principles as much as anything else. We’ll set some definitions, try to find some common agreement on some basic concepts, and lay out the underpinnings for what’s to follow – unless you lay the groundwork first, you run the risk of math for math’s sake, something that provides heat but no light. We’re looking for illumination.
Definition of value
One thing that all of our uberstat metrics have in common is that they attempt to measure a player’s value, generally in runs or wins, to his team. Thanks largely to the need for daily newspapers to justify paying their baseball writers in November, yearly we are, ahem, treated with numbingly trivial arguements about the meaning of the word value. So before we begin, I should clarify what value means for the purposes of this discussion:
A player’s value is his contributions to his team based upon his on-field performance (hitting, running, fielding and pitching) in a neutral context.
I am not trying to claim that this is the only definition. I am not even trying to claim that this is the best definition – that is determined by the specific question you are trying to answer. I am simply laying out the definition of value that most total player metrics are intended and suited to answer.
Yes, this definition ignores things like leadership and character – and it would be horrifying to watch a statistical measure of performance try to capture those things! That is not to say that these things don’t matter, simply that they’re not readily quantifiable.
It’s important to note that we’re interested in a neutral context, and to explore what that means. First, we want to measure a player’s performance independent of his teammates’ – a player is no better on a good team and no worse on a bad team. (If you care to argue that a player’s contributions are more valuable if they’re in support of a pennant or a playoff spot, that’s an entirely separate question.)
We also want to isolate a player from his environment. A poor pitcher is not suddenly a better pitcher if he pitches in Petco and a poor hitter is not a better pitcher if he hits in Coors Field. It is true that their raw stats – ERA, RBI, OPS, and so forth – will look better, but that doesn’t make them any more valuable, because their opponents benefit as well. A run is simply more valuable in Petco and less valuable in Coors.
A note on accuracy, bias and sample size
Because baseball is a team sport, it’s not always obvious how to split credit between players (although the official scorers do try so hard). In order to do so, sabermetricians have to build models of how a baseball team works – how an offense scores runs, how a defense prevents them. We use these models to try and isolate an individual player’s performance. Alfred Korzybski, a Polish scientist and philosopher, once remarked, "A map is not the territory." By the same token, the models that we construct of baseball are simply that, models. This does not make them useless or pointless, as some would have you believe. But it does mean that when using them, we need to bear in mind their limitiations. It helps to bear in mind some of the ways in which a model can be limited:
- The data itself. Sometimes there are simply mistakes – transcription errors and the like. Some things are based on borderline judgments – is that a hit or an error? A fly ball or a line drive? A ball or a strike?
- There can also be important information left out of the data that has to be inferred, or simply ignored – where was the shortstop positioned? Did the coach have the hit-and-run on?
- Constructing a model without an understanding of the basic principles involved – it’s inherent to the nature of baseball that a double is more valuable than a sacrifice fly, but a linear regression is going to have a real hard time figuring that one out.
- Factors that the model ignores – opponent quality, platoon advantage, and so on.
- Failing to account for subtle differences between players – a park will have a different impact on the home run rates of Barry Bonds and Juan Pierre, for instance.
A common way to test a model is to look at its accuracy – how well do its results match up with the observed reality? There are several ways to test accuracy; commonly you can look at how consistent a measure is year-to-year for players, or you can look at how well the model fits a team’s runs scored/allowed.
The most common measure of accuracy among sabermetricians is generally correlation, or a measure of how closely two measures are related. Correlation works best for two measures that use different units. For measures that use the same units – runs, for instance – you are better off using a measure of average error, such as mean absolute error or root mean square error.
Accuracy is desirable, but it does not come without its costs; typically, added accuracy comes at the sake of added complexity. For several reasons – whether it’s ease of explaining, or ease of implementing, or concerns of overadjusting – there may be a threshold of complexity one is willing to accept. This is fine, so long as one understands the tradeoff they’ve made – if you have two players that differ by only 2-3 runs on offense, for instance, it makes little sense to claim that one player is provably better than the other.
The other thing to note about accuracy is that the difference in accuracy typically washes out once you achieve a large enough sample size – in a single game it’s possible a player gets robbed of a Ball Four call by a bad call, but in a whole season those sort of things tend to come out in the wash. This is not true if you have a biased measure. I’d like to make this clear – you can live with a less accurate measurement, so long as you’re understanding of the tradeoffs. This isn’t necessarily true of a biased model – if your model underestimates the value of a walk, then no matter how many games you have, you will underrate a hitter with a high walk rate.
Value versus True-Talent Level
It should be noted that the sampling concerns above are only presented in the interest of accuracy in measuring performance itself, not in measuring the underlying ability of the player. It is possible for us to have a high level of confidence about measuring a player’s value without having a similar level of confidence in measuring his ability.
For instance, prior to the 2008 season, Ryan Ludwick had a career batting line of .251/.319/.446. Then, in 2008, he batted .299/.375/.591, well above what you’d expect from him. Given his career batting line, we have reason to suspect that he was an "overperformer" in 2008. But for the sake of constructing a player value stat, we really don’t care whether or not Ludwick is really as good as his 2008 batting line would suggest. Whether or not he lucked into some more base hits than he perhaps "should" have, those base hits contributed to real runs and wins for his team. And that’s what we’re trying to measure here.
Setting the baseline
A value metric absolutely has to have a baseline, and you can end up with it either on accident or on purpose but you will have one. So given that, it makes sense to put some thought into what baseline you’re using and why. The most common baselines are:
- "Absolute value," or value above zero. Win Shares is the most popular absolute value metric – every win is accounted for in the system.
- Replacement level, commonly defined as value above "freely available talent," or those players who can be had by any club for the league minimum. (Note that I said any club. Evan Longoria may have been paid the league minimum, but the Rays weren’t ready to give him away. Remember that assets given away in trade are a cost to the team, even if they don’t go towards a player’s salary.)
These are not the only baselines available, of course – here at the Hardball Times we have Win Shares Above Bench, which compares a player to the production of the average bench player. You could also, for instance, compare a player to the production of the average starter. There are probably other permutations as well.
It should be noted that – for the most part – the choice of a baseline is a matter of presentation, rather than of actual fact, because the actual meaning of a stat requires you to know the amount of playing time as well. Once you combine the measure of value with the rate of playing time, you can easily convert between any baselines you like. So why should we care about baselines at all?
Baselines matter because we aren’t interested in a player by himself – there is no such thing as a baseball player in isolation. We want to know about a player’s contributions in the context of a team – the marginal contribution of that player relative to who else could have been playing instead. If you’ve ever spent a lot of time on baseball forums and message boards, you’ll often see an idea prefixed with, “What could it hurt to do this?” It’s the opportunity cost – playing time is a (mostly) fixed commodity, and playing time given to one player cannot be given to another.
View that way, then, the “absolute zero” baseline measures a player against the notion of simply playing nobody, or failing that, playing somebody like, well, you and me. (Unless you, the reader, happen to be a professional-caliber baseball player.) To be frank, I’m uncertain that this is very useful – a nonpitcher who bats .151 in a full season has done more at the plate than I would have, but it’s a stretch to say that he contributed any value to his team in doing so.
On the other hand, comparing a player to the average makes some people annoyed, because they point out – correctly – that a below-average player can still have some value to his team. The baseline isn’t saying he doesn’t, of course – he simply has less value than the average baseball player.
The common compromise is the replacement level – it’s not as high as the average baseline, so that half of all players don’t end up in the negatives, but it’s not so low as to become meaningless like the absolute baseline. You can look at replacement level as being in the spirit of the Mendoza Line – it’s the point at which the opportunity cost of even having that player on the roster is greater than the value he could possibly provide to the team. If a player is below the replacement level for too long, he’ll be out of a job, because the team will pick up some minor leaguer off the scrapheap that can probably do a better job.
It should be noted that replacement level is the most difficult to pin down, as the definition of replacement is open to interpretation. The key thing to note here is that you cannot necessarily compare figures between systems that both claim to use the replacement level baseline. This doesn’t make the replacement baseline meaningless – it’s an abstraction, to be sure, but a useful one nonetheless.
You’ve obviously made some sort of mistake – Player X’s value is all wrong!
Baseball fans can be very protective of their home players from time to time – or they can be very, very cruel. This has much to do with how good this player is (or at least, is thought to be) and how well the team has played recently. Fans also like to think that their long, patient devotion to their ballclub has been rewarded with an incalculable amount of knowlege about the players they follow, not readily accessible to outsiders.
Some of them are rather, ahem, agressive in reminding you of this.
I’m a firm believer in testing, and showing proof. If anyone shows you a model for player performance, he should be showing you the evidence that his model works. If he isn’t, you should be very careful in accepting his conclusions – even if you otherwise find him trustworthy.
But simply pointing to a single player’s value and claiming, “That can’t be right” – why can’t it? If there’s absolutely no way that the model can be correct and you wrong, why are you bothering to look at the model’s conclusions in the first place? If our perceptions were perfect, we wouldn’t need to build these models in the first place.
So be skeptical, and ask questions, but please – come prepared to engage the evidence, not to simply dismiss it when it doesn’t match your preconceptions. If you are convinced the model is wrong, then say so – but proffer a reason as to why the model may be inaccurate (or biased). And ask yourself why you’re so certain that it’s the model that’s wrong, not you.
In Part II we’ll examine how to appropriately evaluate a player’s on-field contributions, and in Part III we’ll examine how a player’s value translates into his salary.
References & Resources
Tom Tango explains WAR.
For further reading on pretty much anything, Patriot’s website is probably the single-best resource I’ve ever encountered.