So let’s do some testing. How accurate are these things, anyway?
Ability vs. Value
I have written at length before on the basic principles of ability (or true-talent level) verus value. There’s just one point I want to come around and reemphasize.
People tend to lean upon defense-independent estimates of pitching performance because they better predict future performance. (And, strictly speaking, they do.) This leads to a lot of fantastic confusion about the issue, with the argument being that if we want to look at past performance, we should ignore defense-independent measures and look at actual results.
This is wrong for the same reason that we look at a pitcher’s ERA instead of his win-loss record. A team does not consistently score the same amount of runs every game; thus it is possible for different pitchers, even different pitchers on the same team, to have vastly different amounts of run support. This is not a function of pitching, and the credit or blame for this should not righly be assigned to the pitcher.
It is the same with defensive support. Two pitchers, even two pitchers on the same team, cannot be presumed to have the same quality of support from their defense. Defense-independent pitching statistics seek to give us a way to compare pitchers with different defensive support fairly.
But for a value measure, we do not care if a result came from luck or skill. We attribute defensive performance to the defense, not because the pitcher has no control over it, but because someone else does have control over it.
Home runs, on the other hand, are not under the purvue of the defense (except for a few, very rare cases). Thus, for a value metric, it is appropriate to credit a pitcher for the precise number of home runs allowed, and not an estimate thereof.
A look at the contestants
This is not meant to be an exhaustive survey of the entrants. I have picked three stats that are readily available to the public and that are relatively easy to compute. (And for which the means of doing so have been made public.) All are linear estimates, and their accuracy could be improved by creating dynamic versions of them.
- FIP, or Fielding Independent Pitching, was created by Tom Tango. It’s a simple estimate of how well a pitcher has pitched, given his walk, strikeout and home run rates. It follows the formula (HR*13+(BB+HBP-IBB)*3-K*2)/IP, with the addition of a constant, generally 3.2. For the purposes of this test, the constant has been figured seperately for each season.
- xFIP is FIP with the home runs replaced by their fly balls allowed times the average number of home runs per fly ball (in the sample I used for the test, .13). It was created by our own Dave Studeman. Again, constants were computed seperately for each season.
- tRA is a stat similar to FIP, with two key changes: it accounts for the type of batted ball allowed, and it estimates a pitcher’s outs as well as his runs allowed from the data at hand. What is in use is not tRA as originally calculated but my reimplementation of tRA. Values were computed for each season. tRA was then scaled to make it fit on the scale of ERA, instead of runs per game. Matthew Carruth and Graham MacAree created tRA.
This is an ability test, not a value test. (Which is why we can include xFIP in the judging.) It’s similar to split-half reliability, but instead of using correlation I’m using root mean square error. That lets us know the average of the magnitude of the error between the two samples.
In other words, I split each pitcher season from between 2003 and 2008 into two sets of games: those pitched on even-numbered days, and those on odd-numbered days. I then tested to see how well performance in even-numbered days predicted performance in odd-numbered days. The results:
ERA predicts future ERA rather poorly, with a staggering RMSE of 2.32. In other words, a pitcher with an ERA of 4.00 in the even-numbered sample typically had an ERA ranging anywhere from 1.68 to 6.32. That tells us hardly anything at all. Our best estimator, xFIP, gives us a range of 2.22 to 5.78, much better but still not particularly helpful.
One reason that none of these estimates are able to come very close is because they average only 30 innings per each split half. As the number of innings pitched goes up, our RMSEs go down:
Please do not attend to any one number too closely, because when we slice the data up like this we tend to introduce minute errors due to sample size.
But note this: Even the smallest RMSE for a pitcher with 110 innings in each split half typically has a range from 3.26 to 4.74, assuming an ERA of 4.00 in the first split half. It’s intensely difficult to tell who the best and worst pitchers are, even with a seemingly large amount of innings pitched.
The answer is to use more innings. Use as many innings as you can. And regress to the mean. This is what we have projection systems for, and at some point of complexity in our ERA estimators it’s better to admit what our purpose is and turn to a projection system instead.
And as the number of innings increases, the difference between our entrants shrinks readily. It seems that very quickly we run up upon a point of diminishing returns from incorporating batted ball data into our estimates.
Baby out with the bathwater
One thing to note is that all of these are component ERAs, which means that they calculate an estimated ERA (or RA) based upon a pitcher’s components. This, for one matter, strips out a lot of “luck,” if we want to define luck as timing. In other words, an inning that looks like:
Walk, Strikeout, Groundout, Home Run, Strikeout
has a very different result than
Home Run, Strikeout, Walk, Strikeout, Groundout
even though the pitcher had a roughly equivelent performance. That’s typically “luck” or “noise,” and over time it cancels out.
But does it always? Perhaps not. There are talents, such as the ability to pitch well out of the stretch, or the ability to induce more grounders in double play situations, that may not being accounted for here. In those cases, we’re simply throwing the baby out with the bathwater—discarding talent along with noise.
And for a value metric, we don’t much care to wipe out all the noise that comes with a pitcher’s performance, simply to neutralize the effect of a pitcher’s defense on his performance.
Is there a way for us to account for a pitcher’s defense, without resorting to component ERA? I think there is. Simple Zone Rating should suffice nicely, don’t you think?
See you next week.
References & Resources
RMSE was computed using a weighted average.
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at www.retrosheet.org.