Standard deviation and ERA estimatorsby Glenn DuPaul
February 06, 2013
Sabermetrics is my passion, but that does not mean it always has been.
One of the main reasons I became interested in studying baseball statistics was fielding independent pitching (FIP) and ERA estimators. Over the past few months, my interest in ERA estimators turned into an obsession. During that time, I developed predictive FIP (pFIP), an ERA estimator of my own.
In almost every test that I ran, pFIP came out ahead (higher correlation) of other more established ERA estimators. This lead me to believe that pFIP was the best ERA estimator currently available, but that in no way meant that the metric was without its own flaws.
Below I listed the standard deviation for pFIP, FIP, ERA and two other ERA estimators (SIERA and xFIP), for pitchers who threw at least 100 innings in a season for the years 2010-12:
Unsurprisingly, on a single season basis, ERA has the widest distribution, while pFIP has the tightest.
Quite honestly, the fact that pFIP's standard deviation is significantly smaller than xFIP and SIERA (which are known for typically having small standard deviations) was a cause for concern.
In Colin Wyers' piece on SIERA and other ERA estimators this issue is discussed in great detail:
In a real sense, that’s what we do whenever we use a skill-based metric like xFIP or SIERA. We are using a proxy for regression to the mean that doesn’t explicitly account for the amount of playing time a pitcher has had. We are, in essence, trusting in the formula to do the right amount of regression for us. And like using fly balls to predict home runs, the regression to the mean we see is a side effect, not anything intentional.
Simply producing a lower standard deviation doesn't make a measure better at predicting future performance in any real sense; it simply makes it less able to measure the distance between good pitching and bad pitching. And having a lower RMSE based upon that lower standard deviation doesn't provide evidence that skill is being measured. In short, the gains claimed for SIERA are about as imaginary as they can get, and we feel quite comfortable in moving on.
My understanding of Colin's argument is that metrics like xFIP and SIERA "crudely" regress each pitcher to the mean which would lead to a higher correlation (lower RMSE), but at the same time may not be an accurate measure of a pitcher's true talent level.
It is evident that of the four ERA estimators discussed in this piece, pFIP has the largest regression to the mean for each individual. This fact brings me to a question whose answer I've found myself switching sides on countless times.
What is the point of an ERA estimator?
There are two answers to that question that could hold serious weight in an argument:
- To be the best at predicting (highest correlation with) future ERA
- To be the best representation of a pitcher's true talent level
It would be nice if an ERA estimator came along that could fulfill both of those requirements, but I would argue that that is not the case.
For the first (high correlation with future or next season ERA), the estimation should be seriously regressed to mean. But when one is attempting to estimate a pitcher's true talent level, should that regression be as harsh? At lower innings pitched totals there should be some, but not nearly as strong as when the goal is to simply predict future ERA.
My main issue with the true talent level idea for ERA estimators is how difficult it actually is to calculate that number. An ERA estimator that reflected a pitcher's skill should be able to account for all of the possible factors within the pitcher's control and weed out all of the other factors around the pitcher correctly. The problem is that it is nearly impossible to do.
In the extreme, relievers throw so few innings on a per-season basis that by the time they throw enough innings for us to get a fair idea of their ERA talent, years will have passed. And in all likelihood, their true talent will have changed. Even for starters who throw more innings, their true talent level is tough to decipher out of all the different factors that go into run prevention.
The consensus at this point is that the estimator with a standard deviation as wide as one's true talent ERA and a high correlation with future ERA is the best at measuring true talent. However, there are issues with this approach too.
Pinning down an exact number for the standard deviation of a pitcher's true talent ERA is difficult. This issue was raised in a FanGraphs Community Post by Steve Staude. He showed that from 200 to 1,000 innings pitched, the standard deviation of ERAs range from 0.8 down to 0.5, as the innings increase.
I think most would agree that true talent does not reveal itself at the 200-inning mark, but then where? 500? 750? 1,000?
Most pitchers never reach 1,000 career innings; many do not reach 500. It also takes most pitchers at least three seasons to reach 500 innings, and it seems reasonable that an individual's talent level could change significantly over the course of those seasons.
For a moment though, let's ignore that and look to Wyers' original piece to see that ERA true talent seems to be revealed somewhere between 400 and 500 innings. According to Staude's study, the standard deviation of ERA between 400 and 500 innings ranges from about 0.65 to 0.6; thus, it would make logical sense that an ERA estimator with a high correlation with future ERA and a standard deviation of around 0.6 or 0.65 would be the best true talent estimator.
Interestingly, if we look at the standard deviation that I found for FIP in this article, it falls right in line with that logic. FIP has a higher correlation (in small to medium samples) with future ERA than ERA and it has a standard deviation that is similar to "true talent" ERA. This assumption also falls in line with a trend we often see: A pitcher's career FIP lines up fairly closely to his career ERA.
The fact that most logic would lead one to conclude that FIP is best true talent ERA estimator we have available fascinates me.
Why? Because the structure of FIP is in no way meant to predict future ERA.
FIP is commonly used in that fashion because it does a fairly good job of predicting future ERA, but that is not the statistic's purpose. FIP is meant to be a describer of a pitcher's performance that is scaled to look like ERA. It's best described as a what a pitcher's ERA should have been. That type of description may make FIP sound similar to a true talent evaluator, but it is in no way correlated or meant to describe future performance.
This idea brought about the birth of pFIP.
pFIP regresses the components of FIP (strikeouts, walks, home runs) to predict future performance rather than describe of past performance. In plain English, that idea sounds great and interestingly the math works out, too.
FIP's more volatile components (walks and homers) receive a fair amount of regression, while strikeouts (the least volatile) receive little or no regression and these regressions result in a stronger correlation with next season ERA than when simply using FIP.
But is pFIP really saying anything about a pitcher's true talent level?
I would argue that it may give one some of indication of a pitcher's talent, but it is not a true talent evaluator. If you look at either the pFIP equation, the standard deviation of pFIP or an individual's numerical pFIP, what the statistic actually does becomes very clear.
Essentially, pFIP starts each individual's ERA projection at the same point (the mean ERA) and then moves each number slightly away from the mean based on the player's individual peripherals. This strategy works great when your goal is to predict with the highest rate of success, but it does not give you a great idea of a pitcher's actual true ERA or skill.
Thus, when one decides to evaluate pFIP as a statistic one must return to our original question: What is the purpose of an ERA estimator?
pFiP is essentially useless if you'd like to evaluate a pitcher's talent level, but if your goal is to predict next season's ERA then pFIP will serve you well. However, if predicting future ERA is the only real purpose of pFIP then is there any real reason for the statistic?
Projection systems are a very real thing, and their goal (at least from what I understand) is to do exactly what pFIP does; project future performance. I've shown before that pFIP is fairly comparable to projection systems when looking at overall correlation with next season runs (or ERA). Although, I'm fairly certain that simple correlation with the next season is notthe best way to test how well a projection system works, let's say for a moment that it is.
Is pFIP really better or equivalent to a projection system?
The short answer is quite obviously no, but the evidence behind that assessment is fairly educational.
I'm not saying this is true, but consider a fantasy scenario where pFIP has exactly the same correlation with future ERA as an average projection system. How would we test which one was actually doing a better job? A good starting point would be to consider the standard deviations of the projected ERAs.
I looked at the standard deviations of ERA projections for three projection systems (Marcel, Bill James and ZIPS) for the years 2010-2012 for pitchers who were projected to have at least 100 innings in that season:
*ZIPS does not project playing time, so Marcel's playing time projections were used for the pitchers in the sample.
Under the assumption that pFIP and the projection systems have similar or equivalent correlations, it would seem that projection systems do a better job at really projecting future performance/skill as their distributions are wider.
This should not really be too surprising, as projection systems take a great deal more information into account than pFIP. Also projection systems are even more useful as they project playing time and counting stats as well as the rate stats (like ERA, FIP, etc.)
This all brings me back again to my original question: What is the point of an ERA estimator? Or to be more clear, if we have projection systems, then what is the point of ERA estimators?
I can think of only two answers to that question.
The first is that some people don't trust or find utility in projection systems and thus find sanctuary in using much simpler ERA estimators, which are still fairly predictive and easier to understand. The second is that ERA estimators should be a reflection of a pitcher's true talent level, which, of course, is almost impossible to define.
References and Resources
All statistics come courtesy of FanGraphs.