Last week, we left off with a bit of a cliffhanger. Hopefully we can resolve those issues now.
But first, I want to clarify the reason for doing this. There are a lot of different rate metrics of offense available out there, such as EqA, wOBA, and OPS+. (There are also quite a few measures of runs above average and replacement in the world.) The idea is to capture which best measures a player’s contribution on offense. Now, onwards.
What’s wrong with RMSE?
I don’t mean to pick on root mean squared error—I actually like RMSE, in situations that call for it. I don’t mean to pick on correlation or mean absolute error either. They’re valuable tools in the toolbox that we use to evaluate our results. But each tool has its limits, and we need to be careful about not using a tool past those limits.
Let’s check on an updated version of the chart from last week. There have been a few changes: We’re now looking at all games (rather than halfinnings) from 1954 to 2008, plus the 1953 NL. (This is commonly referred to as the “Retroera”.) I have tweaked the way a few of the run estimators are applied, based upon feedback and some additional bugchecking I’ve done. The biggest change is in the way I adjusted the run results. The formula to convert rates to runs is typically something along the lines of:
(2*(Rate/LgRate)1) * PA * R_PA
Where Rate is the stat under consideration, LgRate is the average, and R_PA is the average runs per plate appearance. In other words, the formula first gives us the player’s production relative to average and then is converted to total runs.
And this makes sense for a metric evaluating a position player because whenever a player can avoid making an out, he contributes additional runs to his team by creating more plate appearances. At the team level (and this is true if you want to look at games, innings or seasons), though, a team is limited to a set number of outs regardless, and so outs is the correct measure of playing time. So, for these purposes, we will be looking at runs per out, not runs per plate appearance from now on. And, the table:
RC

BsR

EqR

OPS

OPS+

GPA

TA

wOBA

Reg

House


R

0.86

0.87

0.85

0.84

0.84

0.84

0.85

0.87

0.86

0.87

MAE

1.86

1.78

1.90

1.93

1.92

1.91

1.94

1.79

1.82

1.79

RMSE

2.37

2.25

2.46

2.46

2.45

2.46

2.49

2.30

2.30

2.27

Again, we see little to differentiate one run estimator from the next. Going by MAE, we see a difference in only .16 runs per game from the best and the worst run estimators. This difference tends to become even smaller at the team season level, because the spread of run environments becomes smaller. (BaseRuns, the undisputed king of the inninglevel test, appears far more ordinary at the game level for that very same reason—as the variation between runs scored becomes smaller, the opportunities for a run estimator to be more accurate grows smaller.)
Is there anything to be gained by such small gains in RMSE? Or is it fair to say that any measure of offense is good enough for the job, so well as it’s reasonably well designed?
In short, no.
If we look at how these metrics are designed, we can see a very real difference of opinion between them on important matters. A metric like Total Average, for instance, considers the walk to be as valuable as a single. Any of our metrics based upon OBP and SLG, on the other hand, treat the walk as only about half as valuable as the single. Is this a big deal?
When we aggregate at the team level, this usually isn’t a big deal at all. Let’s consider the walk, including (for right now) the hit by pitch and the intentional walk. In 2008, for instance, if we look at the Red Sox, who walked the most, and the Royals, who walked the least, we only find a difference of about 26 walks per 650 plate appearances. So long as the overall construction of the metric is generally sound, this is not going to significantly impact the overall RMSE or correlation coefficient at the team level.
But what about individual players? Looking at qualified starters in 2008, we have Jack Cust as the absolute walkingest player there was and Yuniesky Betancourt as the least. Over the course of 650 plate appearances, there’s a difference of 104 walks between the two players.
The problem with our tests is that we are validating at the team level and then applying these values to individual players. But there is simply not enough variation between teams at the seasonal, game or inning level to truly test the differences between hitters.
Two of every sort
What we cannot do is simply test every run estimator against individual players, as we would like to do. After all, if we could figure out how many runs a player contributes without run estimators, we wouldn’t need to conduct any of this testing. So we need to come at the problem from a different angle. One idea is to look at how each run estimator values each event, relative to how much each event is worth in runs.
So how can we estimate how well each run estimator handles, say, the walk? One way is by using matched pairs. What we do is look at a pair of games which had exactly the same number of all events (such as singles, doubles and home runs) except for one game having an additional walk. Then we look at the average number of additional runs that score in our “plus one” games, as well as the number of additional runs that our estimators say should have scored. This allows us to measure the accuracy of all of our run estimators at the individual event level.
(Why games instead of innings, as we were working with last week? It turns out that run scoring tends to cluster—you cannot score half runs or quarter runs, so in order to make everything balance out the majority of innings are scoreless, roughly threequarters in fact. When you do matched pairs, you end up with nearly 99 percent scoreless innings. Matching at the game level provides a more natural distribution of run scoring.)
And now, for the results of that trial. “Num” indicates how many of each event there were over the entirety of the Retroera; Similarity is a measure of how close each run estimator is to the observed runs, weighted by the number of events in the sample. A smaller similiarity is better.
Event

Num

R

EqR

OPS

OPS+

GPA

TA

BsR

RC

wOBA

Reg

House

1B

1322772

0.46

0.45

0.50

0.52

0.50

0.30

0.42

0.50

0.46

0.53

0.46

2B

331713

0.72

0.76

0.86

0.86

0.76

0.59

0.73

0.79

0.75

0.61

0.73

3B

47125

0.86

1.10

1.25

1.22

1.03

0.90

1.03

1.11

1.04

1.23

1.04

HR

190386

1.36

1.42

1.63

1.57

1.31

1.19

1.44

1.37

1.39

1.45

1.39

BB

642069

0.32

0.29

0.25

0.28

0.32

0.30

0.28

0.22

0.31

0.34

0.31

HBP

52857

0.28

0.26

0.21

0.25

0.30

0.29

0.28

0.22

0.31

0.31

0.30

Similarity

0.04

0.11

0.10

0.04

0.13

0.05

0.07

0.03

0.09

0.03

On average, based upon our matched pairs sample, during the Retroera a walk was worth an additional .32 runs on average. Our robust linear weights measures, like wOBA and the “house” weights, match up very well here. Something like RC or OPS fares much more poorly here, with values of .22 and .25 for the walk respectively. Those measures are going to underrate our highwalk players and overrate our lowwalk players.
Something like TA, on the other hand, measures the walk about fine, but vastly underrates the single. That means that it will underrate highaverage, lowslugging players like Ichiro and overrate lowaverage, highslugging players like Adam Dunn.
Going by similarity, we can see that measures like wOBA and the house weights are very close to the observed values; metrics like BsR, EqR and GPA also do very well here. (And bear in mind that BsR is not tuned to the environment, which could improve the accuracy here.) In the middle of the pack are things like Basic RC and the regressionbased linear weights. The worst contestants are the OPSderived measures and TA.
Some reservations
This test isn’t perfect. You’ll note that until now I haven’t addressed the issue of the triple. Three things could be true here:
 Every serious run estimator ever devised overweights the triple significantly.
 This is a sampling error caused by the low number of triples in general.
 There is a selective sampling problem with the triple in doing a matchedpair study.
My personal feeling is that the problem is number three, but I have no evidence for this.
This is certainly true of the stolen base and the caught stealing, however. Teams tend to attempt steals much more frequently in close games, which unduly biases the sample. This is why those terms are excluded from the study.
I don’t think these problems invalidate the study as a whole, but that’s only my opinion, and as with medicine you should probably seek a second one.
What’s the point?
There is nothing in the world requiring that you be entirely accurate or even as accurate as possible or is reasonable. If all you’ve ever used is and all you ever want to use is OPS, I can’t stop you. And I can’t make sites that publish flawed run estimators correct their mistakes.
But if we want to be correct, or at least as correct as possible, we need to consider the potential biases of a run estimator, not only its accuracy on the whole. And after all, it is very easy to figure out the very easy cases. We don’t need a run estimator to tell us that Albert Pujols is good at hitting baseballs or that Michael Bourn isn’t. It’s the more difficult cases where we need run estimators. And those are the players most likely to test the biases of our estimates.
And if you are planning on testing your latest and greatest run estimation formula, please, spare us the same tired R and RMSE tests where only your formula gets the benefit of being tuned to the environment at hand. Thanks.
References & Resources
For more reading on the difference between runs per PA and runs per out, read this.
The similarity measures in the article were inspired by PECOTA’s sim scores, and are calculated by using a derivation of the Pythagoeran Theorum (the one developed by Pythagoreas, not the one developed by Bill James.)
While Equivelent Runs grades out very well here, there are some concerns about the way Equivelent Average is figured, totally apart from its accuracy as a run estimator.
Some notes on the modificaitons made since the last batch of tests. Since I changed the dataset in use, I ran another regression to estimate the regression weights, which are now:
0.53*1B + 0.61*2B + 1.23*3B + 1.46*HR + 0.34*BB + 0.31*HBP .11*IBB + 0.18*SB – 0.05*CS 0.10*Outs
wOBA is not converted to runs the same way as the other rates; it is in fact not converted to runs at all. Instead, I took the base weights used to generate wOBA (based upon work by Tom Tango) and applied them to the data, the same as the House weights. To clarify something: wOBA could be computed using the house or the regression weights as easily as the weights used; this is intended to measure the accuracy of the values provided by a specific implementation (the one in use at Fangraphs), rather than the concept of wOBA in particular.
Some great help was provided in this thread at Tango’s blog. Special thanks to Tango, Patriot and terpsfan. Terpsfan corrected a key error in figuring the value for CS in the updated House weights.