Should we be using ERA estimators during the season?
by Glenn DuPaulSeptember 19, 2012
The 2012 season is quickly coming to a close. September is the time of year where baseball fans and writers are either looking forward to the playoffs or looking back on the season and wondering what could have been.
Before the season and during the season, the mindset of the baseball community is different. For example, a major talking point before any season is projections. The various systems (Oliver, PECOTA, Marcel, ZIPS, Steamer, etc.) release their projections and it leads to much excitement and discussion over who could collapse, break out, or hold serve in the coming season. In a recent post at the Book Blog, Tom Tango posed the question of whether this was a bad season for forecasters (projection systems).
It was an interesting question, something people begin thinking about this time of year, but it also piqued my interest into another question.
There's a group of statistics in the sabermetric community known as "ERA-estimators," These statistics are based on outcomes that are more under a pitcher's control (strikeouts, walks, groundballs, home runs), typically known as peripherals. They attempt to forecast where a pitcher's ERA is going to move in the future.
The most common ERA estimators currently are fielding independent pitching (FIP), expected fielding independent pitching (xFIP) , skill-intereactive ERA (SIERA) and true ERA (tERA).
How well do these estimators typically work?
- Matt Swartz has shown that SIERA is the best predictor of next season ERA
- Bill Petti showed that for pitchers who pitch in more hitter-friendly parks xFIP and FIP perform better than SIERA
- Colin Wyers showed that when you get out of the season-season comparison, which can have a great deal of random variation, ERA begins to perform the best at predicting itself
My goal is to not re-hash these studies, but instead to delve into what happened just this season.
The three studies I mentioned above looked at large(ish) sample sizes; year-to-year data or bigger. Typically, when we study baseball statistics we look for a large sample size; because there is so much random variation and noise in baseball, it's tough to get a full picture of what truly happened when dealing with a smaller sample. In many instances, one season of data isn't a large enough sample for some statistics, which might sound crazy to some, but it's actually true.
The idea
Writers in the sabermetric community, myself included, talk about ERA estimators during the season fairly frequently. For example, this made-up quote would be a fairly common thing to read on sabermetric websites in the middle months of the season:
Pitcher X's ERA (2.50) is much lower than his xFIP (4.50). This result indicates that Pitcher X has probably been lucky, and his ERA will regress back closer to his xFIP as the season goes forward.
That idea is fairly commonly accepted. If a pitcher's xFIP, FIP and SIERA are significantly above or below his current ERA, then the assumption for most is that his ERA will move back either up or down toward those numbers.
The original goal of this post was to split the season in half, and look at how the ERA estimators have done in terms of predicting ERA and runs allowed per nine innings for the second half of the season. Essentially, I agreed with commonly accepted idea that a pitcher's ERA estimators at the midpoint of the season were better indicators of where his ERA was trending than his actual ERA.
I thought that although half a season of baseball is a small (random) sample size, it is still valuable for teams to know what they should expect from their pitchers in the second half. This information would be useful in certain midseason decisions front offices have to make. Some examples would be
- Sending players down to the minors
- Moving a pitcher out of the rotation
- Deciding if your team was contender
- Deciding what players to trade away or which players to target in a trade.
A quick example of this idea comes from the comparison between the Angels acquiring Zack Greinke and the Rangers acquiring Ryan Dempster at this season's trading deadline.
At that time, Greinke had an ERA of 3.44, while Dempster's ERA was 2.55. But Greinke's xFIP was 2.82, while Dempster's was 3.73. Thus, many predicted positive regression for Greinke's ERA with the Angels, and negative regression for Dempster's ERA with the Rangers.
Please note that I understand we'd expect their ERAs to fluctuate somewhat anyway after the trade. Both players were changing leagues and ballparks, and would have different defenses playing behind them. At the same time, had both those pitchers stayed with their original ball clubs, the assumption that Greinke would have positive regression and Dempster would have negative regression would still likely have been the consensus.
The study
For this study I used July 1 as the cutoff point. Then I looked only at starting pitchers who had at least 50 innings pitched before July 1 and at least 45 innings pitched after that date.
I found the ERA, FIP, xFIP, SIERA and tERA, for each qualifying pitcher, from the beginning of the season to July 1, then regressed those numbers against their runs allowed (RA9) and ERA for the second half of the season (July 1-Sept. 16). I also added in an extremely simple baseline of strikeouts minus walks divided by innings pitched (K-BB/IP), as another predictor. Interestingly, exactly 100 starters qualified for the sample.
Also, please note that although 50 and 45 respectively were the minimum number of innings, the average number of innings thrown before July 1 for the sample was 92 innings, and the average number thrown after July 1 was 81 innings. So a good portion of these numbers are based on close to 100 innings, which is still not a great sample, but at least feels a lot better than 45-50 innings.
The results
First, I ran simple linear regression for each predictor against the pitcher's second half runs against (RA9). In a table below, I list both the r-squared and mean square error for each predictor in the sample.
For those who aren't statistically savvy, r-squared shows the percent of variation in what we are trying to predict (RA9), that is explained by our predictor (ERA, xFIP, etc.). A higher r-squared shows a stronger relationship between the predictor and outcome.
The mean squared error shows us how far, on average, our prediction is away from the actual outcome; thus, a lower number would show a stronger relationship.
Here are the RA9 single regression results:
| Predictor | R-Sqaured | RMSE |
|---|---|---|
| (K-BB)/IP | 9.14% | 1.207 |
| SIERA | 6.19% | 1.246 |
| xFIP | 4.65% | 1.267 |
| FIP | 2.92% | 1.290 |
| ERA | 1.86% | 1.304 |
| tERA | 0.43% | 1.343 |
RA9 is a better statistic than ERA, but, as I noted form the outset, these metrics are supposed to be ERA estimators, not RA9 estimators (for better or worse).
This is most likely why we see a near-zero r-squared for tERA, because it is scaled on purpose to predict ERA, instead of RA9.
So I ran simple linear regression for the predictors against ERA, as well:
| Predictor | R-Squared | RMSE |
|---|---|---|
| (K-BB)/IP | 8.84% | 1.092 |
| SIERA | 5.99% | 1.127 |
| xFIP | 4.48% | 1.145 |
| tERA | 3.04% | 1.162 |
| FIP | 2.42% | 1.170 |
| ERA | 1.45% | 1.185 |
These numbers jibe fairly well with the single-season results from the three studies I referred to at the outset of the article.
The most shocking result is that for both tests, the predictor with the highest r-squared and the lowest mean squared error was the simple base line of strikeouts minus walks divided by innings pitched.
In Swartz' study, the second best predictor of Year 2 ERA, behind SIERA, was a statistic known as kwERA (strikeout to walk ERA. whjch uses only strikeouts and walks. I actually considered kwERA for my baseline, as it does a better job of actually weighting the value of strikeouts and walks, and is already on an ERA scale. But I wanted to keep my baseline as simple as possible, so I just used simple subtraction, and even left intentional walks in the data.
Interestingly, strikeouts minus walks still ended up being the best predictor.
Simply comparing six separate predictors' single linear regressions isn't as effective of an analysis as running a multiple regression that includes all six predictors at the same time. So I ran a multiple regression with all six predictors thrown in:
The first table is the SPSS readout of coefficients for the RA9 test:
| RA9 | Unstandardized | Coefficients | Stand. Coeff. | ||
|---|---|---|---|---|---|
| Predictors | B | Std. Error | Beta | t-score | Sig. |
| (Constant) | 5.671 | 2.144 | 2.645 | 0.01 | |
| K-BB | -2.074 | 1.292 | -0.352 | -1.605 | 0.112 |
| ERA | 0.14 | 0.173 | 0.13 | 0.808 | 0.421 |
| FIP | -0.229 | 0.437 | -0.171 | -0.524 | 0.602 |
| xFIP | -0.018 | 1.104 | -0.009 | -0.016 | 0.987 |
| tERA | 0.074 | 0.341 | 0.057 | 0.215 | 0.829 |
| SIERA | -0.023 | 1.223 | -0.012 | -0.018 | 0.985 |
The second table is the SPSS readout of coefficients for the ERA test:
| ERA | Unstandardized | Coefficients | Stand. Coeff. | ||
|---|---|---|---|---|---|
| Predictors | B | Std. Error | Beta | t-score | Sig. |
| (Constant) | 5.388 | 2.108 | 2.67 | 0.009 | |
| K-BB | -2.015 | 1.216 | -0.363 | -1.657 | 0.101 |
| ERA | 0.156 | 0.163 | 0.153 | 0.856 | 0.342 |
| FIP | -0.332 | 0.412 | -0.263 | -0.807 | 0.422 |
| xFIP | 0.002 | 1.039 | 0.001 | 0.002 | 0.998 |
| tERA | 0.099 | 0.321 | 0.081 | 0.308 | 0.758 |
| SIERA | 0.004 | 1.151 | 0.002 | 0.003 | 0.997 |
The column we want to look at here is titled "Sig." This column tells the statistical significance of each predictor. For most tests, a predictor becomes statistically significant once the value goes below 0.05. As you can see from both of these results, none of the predictors are statistically significant; strikeout minus walks comes the closest.
I found that putting all of the predictors together did not really improve the r-squared we found from just using K-BB/IP:
| Mutiple Regression r^2 | K-BB/IP r^2 | |
|---|---|---|
| RA9 | 10.10% | 9.1% |
| ERA | 10.40% | 8.8% |
I also found that K-BB/IP was a statistically significant predictor on its own, but when the other predictors were added it no longer was statistically significant. This is most likely due to a degrees of freedom issue (sample size of 100 with six predictors), but as I've already got into too much statistical jargon, I'll just leave that be.
Of the 100 pitchers in the sample, 13 changed teams at some point during this season. As I noted with the Greinke/Dempster comparison earlier, this could have an effect on the results. Future ERAs could fluctuate when a pitcher changes leagues, teams and home ballparks. So I checked to see how removing those pitchers would affect the results.
Below, I listed the r-squareds for the predictors for the 87 pitchers who have stayed with the same team all season:
| Predictor | ERA r^2 | RA9 r^2 |
|---|---|---|
| (K-BB)/IP | 13.01% | 12.65% |
| SIERA | 8.40% | 8.18% |
| xFIP | 6.87% | 6.72% |
| tERA | 7.28% | 0.75% |
| FIP | 5.17% | 5.48% |
| ERA | 1.70% | 2.30% |
Removing the 13 starters who changed teams improved the overall r-squareds slightly, but did not really change the two orders we saw with the original sample that included those starters.
Putting it all together
The number of tables and tests I just went through was probably exhausting, but I think it was pretty meaningful.
Most of these statistics become more meaningful as the sample size grows larger. You could classify all this information as simply small sample size noise. I'm looking at less than one season worth of data, for just 100 starters (or only 87 if you prefer those numbers). There's a lot to be said for that argument.
ERA and RA9 in general are subject to a good deal of random variation and noise. These predictors were regressed against a sample of ERAs and RA9s that came from a range of 50.1 and 102 innings pitched. I think there's a possibility that this analysis could be run again with the numbers from 2011, and we'd see a different predictor come out on top, solely because of that noise.
At the same time, I think these results should be taken as both a lesson and a cautionary tale. The ERA estimators that were tested (xFIP, FIP, SIERA and tERA) all did a better job of predicting future ERA than actual ERA; which was to be expected and is the normal assumption in the sabermetric community. But although they did better than ERA, simply subtracting walks from strikeouts did a better job of predicting ERAs for the second half than any of the advanced statistics.
I'm not trying to say that we should move away from FIP and other ERA estimators and simply use strikeouts and walks to attempt to predict how many runs a pitcher will give up in the future.
The highest r-squared (0.13055) I found came from K-BB/IP in the 87-pitcher sample. That number still tells us that more than 86 percent of the variation in second half ERA was still left unexplained by the predictor; which isn't very good at all.
Instead, my point is that maybe we shouldn't even be using the results of the first half to attempt to predict ERAs for the second half of the season.
For example before July 1, Kyle Lohse's ERA was 2.82, but his xFIP was 4.19. The normal assumption would be that Lohse had been lucky and we should trust his xFIP and assume that his ERA would regress negatively, in the second half.
His post-July 1 ERA is 2.81, essentially the same as it was during the first half. This is an extreme example, but I think it is something to learn from.
Maybe too often those in the sabermetric community simply assume that pitchers will regress toward their peripherals as the season goes on. But most of the time that regression doesn't have time to occur in just half of a season.
Those who have read about sabermetrics long enough are probably sick of the phrase small sample size (SSS!!!). But, I think people who write about sabermetrics still fall prey to small sample sizes. I did when I began the idea for this article. I simply assumed that the ERA estimators from the first half would have a pretty strong correlation to second half ERA and RA9 numbers, and I was ready to write about which had been doing the best job this season. Then I found the results and realized that none had been doing well. And not only that, but something as simple as subtracting walks from strikeouts did better.
Therin lies the rub. In small samples baseball statistics are still very unpredictable, even when using the most "advanced metrics" that were created to to predict them.
So, next June when a starting pitcher has an ERA over five, but a SIERA in the mid-threes, please be wary of assuming that his ERA will regress over the next three months of the season.
References and Resources
All statistics come courtesy of FanGraphs and are updated through Sunday, Sept. 16.
Glenn is an Economics major at Lehigh University. He works as a Research & Development intern for Baseball Info Solutions. He also writes about sabermetrics for Beyond the Box Score. You can follow him on twitter @Glenn_DuPaul or email him at .(JavaScript must be enabled to view this email address)







 
I love sabermetrics, and this article is great. I’m really glad you tackled this question, Glenn, and the results confirmed my suspicion, as something I’ve been yelling about for a while. These are good estimators, but lets remember what FIP stands for: Fielding Independent Pitching! It shouldn’t come as a surprise that it (and other estimators) perform poorly within a season, most of the time, the pitcher is pitching in front of the same defense. Obviously there are other factors at play as well, but too often, people have been using these tools to suggest future results without strong premises for doing so.