The 2012 season is quickly coming to a close. September is the time of year where baseball fans and writers are either looking forward to the playoffs or looking back on the season and wondering what could have been.

Before the season and during the season, the mindset of the baseball community is different. For example, a major talking point before any season is projections. The various systems (Oliver, PECOTA, Marcel, ZIPS, Steamer, etc.) release their projections and it leads to much excitement and discussion over who could collapse, break out, or hold serve in the coming season. In a recent post at the Book Blog, Tom Tango posed the question of whether this was a bad season for forecasters (projection systems).

It was an interesting question, something people begin thinking about this time of year, but it also piqued my interest into another question.

There’s a group of statistics in the sabermetric community known as “ERA-estimators,” These statistics are based on outcomes that are more under a pitcher’s control (strikeouts, walks, groundballs, home runs), typically known as peripherals. They attempt to forecast where a pitcher’s ERA is going to move in the future.

The most common ERA estimators currently are fielding independent pitching (FIP), expected fielding independent pitching (xFIP) , skill-intereactive ERA (SIERA) and true ERA (tERA).

How well do these estimators typically work?

{exp:list_maker}Matt Swartz has shown that SIERA is the best predictor of next season ERA

Bill Petti showed that for pitchers who pitch in more hitter-friendly parks xFIP and FIP perform better than SIERA

Colin Wyers showed that when you get out of the season-season comparison, which can have a great deal of random variation, ERA begins to perform the best at predicting itself {/exp:list_maker}

My goal is to not re-hash these studies, but instead to delve into what happened just this season.

The three studies I mentioned above looked at large(ish) sample sizes; year-to-year data or bigger. Typically, when we study baseball statistics we look for a large sample size; because there is so much random variation and noise in baseball, it’s tough to get a full picture of what truly happened when dealing with a smaller sample. In many instances, one season of data isn’t a large enough sample for some statistics, which might sound crazy to some, but it’s actually true.

### The idea

Writers in the sabermetric community, myself included, talk about ERA estimators during the season fairly frequently. For example, this made-up quote would be a fairly common thing to read on sabermetric websites in the middle months of the season:

Pitcher X’s ERA (2.50) is much lower than his xFIP (4.50). This result indicates that Pitcher X has probably been lucky, and his ERA will regress back closer to his xFIP as the season goes forward.

That idea is fairly commonly accepted. If a pitcher’s xFIP, FIP and SIERA are significantly above or below his current ERA, then the assumption for most is that his ERA will move back either up or down toward those numbers.

The original goal of this post was to split the season in half, and look at how the ERA estimators have done in terms of predicting ERA and runs allowed per nine innings for the second half of the season. Essentially, I agreed with commonly accepted idea that a pitcher’s ERA estimators at the midpoint of the season were better indicators of where his ERA was trending than his actual ERA.

I thought that although half a season of baseball is a small (random) sample size, it is still valuable for teams to know what they should expect from their pitchers in the second half. This information would be useful in certain midseason decisions front offices have to make. Some examples would be

{exp:list_maker}Sending players down to the minors

Moving a pitcher out of the rotation

Deciding if your team was contender

Deciding what players to trade away or which players to target in a trade. {/exp:list_maker}

A quick example of this idea comes from the comparison between the Angels acquiring Zack Greinke and the Rangers acquiring Ryan Dempster at this season’s trading deadline.

At that time, Greinke had an ERA of 3.44, while Dempster’s ERA was 2.55. But Greinke’s xFIP was 2.82, while Dempster’s was 3.73. Thus, many predicted positive regression for Greinke’s ERA with the Angels, and negative regression for Dempster’s ERA with the Rangers.

Please note that I understand we’d expect their ERAs to fluctuate somewhat anyway after the trade. Both players were changing leagues and ballparks, and would have different defenses playing behind them. At the same time, had both those pitchers stayed with their original ball clubs, the assumption that Greinke would have positive regression and Dempster would have negative regression would still likely have been the consensus.

### The study

For this study I used July 1 as the cutoff point. Then I looked only at starting pitchers who had at least 50 innings pitched before July 1 and at least 45 innings pitched after that date.

I found the ERA, FIP, xFIP, SIERA and tERA, for each qualifying pitcher, from the beginning of the season to July 1, then regressed those numbers against their runs allowed (RA9) and ERA for the second half of the season (July 1-Sept. 16). I also added in an extremely simple baseline of strikeouts minus walks divided by innings pitched (K-BB/IP), as another predictor. Interestingly, exactly 100 starters qualified for the sample.

Also, please note that although 50 and 45 respectively were the minimum number of innings, the average number of innings thrown before July 1 for the sample was 92 innings, and the average number thrown after July 1 was 81 innings. So a good portion of these numbers are based on close to 100 innings, which is still not a great sample, but at least feels a lot better than 45-50 innings.

### The results

First, I ran simple linear regression for each predictor against the pitcher’s second half runs against (RA9). In a table below, I list both the r-squared and mean square error for each predictor in the sample.

For those who aren’t statistically savvy, r-squared shows the percent of variation in what we are trying to predict (RA9), that is explained by our predictor (ERA, xFIP, etc.). A higher r-squared shows a stronger relationship between the predictor and outcome.

The mean squared error shows us how far, on average, our prediction is away from the actual outcome; thus, a lower number would show a stronger relationship.

Here are the RA9 single regression results:

Predictor | R-Sqaured | RMSE |
---|---|---|

(K-BB)/IP | 9.14% | 1.207 |

SIERA | 6.19% | 1.246 |

xFIP | 4.65% | 1.267 |

FIP | 2.92% | 1.290 |

ERA | 1.86% | 1.304 |

tERA | 0.43% | 1.343 |

RA9 is a better statistic than ERA, but, as I noted form the outset, these metrics are supposed to be ERA estimators, not RA9 estimators (for better or worse).

This is most likely why we see a near-zero r-squared for tERA, because it is scaled on purpose to predict ERA, instead of RA9.

So I ran simple linear regression for the predictors against ERA, as well:

Predictor | R-Squared | RMSE |
---|---|---|

(K-BB)/IP | 8.84% | 1.092 |

SIERA | 5.99% | 1.127 |

xFIP | 4.48% | 1.145 |

tERA | 3.04% | 1.162 |

FIP | 2.42% | 1.170 |

ERA | 1.45% | 1.185 |

These numbers jibe fairly well with the single-season results from the three studies I referred to at the outset of the article.

The most shocking result is that for both tests, the predictor with the highest r-squared and the lowest mean squared error was the simple base line of strikeouts minus walks divided by innings pitched.

In Swartz’ study, the second best predictor of Year 2 ERA, behind SIERA, was a statistic known as kwERA (strikeout to walk ERA. whjch uses only strikeouts and walks. I actually considered kwERA for my baseline, as it does a better job of actually weighting the value of strikeouts and walks, and is already on an ERA scale. But I wanted to keep my baseline as simple as possible, so I just used simple subtraction, and even left intentional walks in the data.

Interestingly, strikeouts minus walks still ended up being the best predictor.

Simply comparing six separate predictors’ single linear regressions isn’t as effective of an analysis as running a multiple regression that includes all six predictors at the same time. So I ran a multiple regression with all six predictors thrown in:

The first table is the SPSS readout of coefficients for the RA9 test:

RA9 | Unstandardized | Coefficients | Stand. Coeff. | ||
---|---|---|---|---|---|

Predictors | B | Std. Error | Beta | t-score | Sig. |

(Constant) | 5.671 | 2.144 | 2.645 | 0.01 | |

K-BB | -2.074 | 1.292 | -0.352 | -1.605 | 0.112 |

ERA | 0.14 | 0.173 | 0.13 | 0.808 | 0.421 |

FIP | -0.229 | 0.437 | -0.171 | -0.524 | 0.602 |

xFIP | -0.018 | 1.104 | -0.009 | -0.016 | 0.987 |

tERA | 0.074 | 0.341 | 0.057 | 0.215 | 0.829 |

SIERA | -0.023 | 1.223 | -0.012 | -0.018 | 0.985 |

The second table is the SPSS readout of coefficients for the ERA test:

ERA | Unstandardized | Coefficients | Stand. Coeff. | ||
---|---|---|---|---|---|

Predictors | B | Std. Error | Beta | t-score | Sig. |

(Constant) | 5.388 | 2.108 | 2.67 | 0.009 | |

K-BB | -2.015 | 1.216 | -0.363 | -1.657 | 0.101 |

ERA | 0.156 | 0.163 | 0.153 | 0.856 | 0.342 |

FIP | -0.332 | 0.412 | -0.263 | -0.807 | 0.422 |

xFIP | 0.002 | 1.039 | 0.001 | 0.002 | 0.998 |

tERA | 0.099 | 0.321 | 0.081 | 0.308 | 0.758 |

SIERA | 0.004 | 1.151 | 0.002 | 0.003 | 0.997 |

The column we want to look at here is titled “Sig.” This column tells the statistical significance of each predictor. For most tests, a predictor becomes statistically significant once the value goes below 0.05. As you can see from both of these results, none of the predictors are statistically significant; strikeout minus walks comes the closest.

I found that putting all of the predictors together did not really improve the r-squared we found from just using K-BB/IP:

Mutiple Regression r^2 | K-BB/IP r^2 | |
---|---|---|

RA9 | 10.10% | 9.1% |

ERA | 10.40% | 8.8% |

I also found that K-BB/IP was a statistically significant predictor on its own, but when the other predictors were added it no longer was statistically significant. This is most likely due to a degrees of freedom issue (sample size of 100 with six predictors), but as I’ve already got into too much statistical jargon, I’ll just leave that be.

Of the 100 pitchers in the sample, 13 changed teams at some point during this season. As I noted with the Greinke/Dempster comparison earlier, this could have an effect on the results. Future ERAs could fluctuate when a pitcher changes leagues, teams and home ballparks. So I checked to see how removing those pitchers would affect the results.

Below, I listed the r-squareds for the predictors for the 87 pitchers who have stayed with the same team all season:

Predictor | ERA r^2 | RA9 r^2 |
---|---|---|

(K-BB)/IP | 13.01% | 12.65% |

SIERA | 8.40% | 8.18% |

xFIP | 6.87% | 6.72% |

tERA | 7.28% | 0.75% |

FIP | 5.17% | 5.48% |

ERA | 1.70% | 2.30% |

Removing the 13 starters who changed teams improved the overall r-squareds slightly, but did not really change the two orders we saw with the original sample that included those starters.

### Putting it all together

The number of tables and tests I just went through was probably exhausting, but I think it was pretty meaningful.

Most of these statistics become more meaningful as the sample size grows larger. You could classify all this information as simply small sample size noise. I’m looking at less than one season worth of data, for just 100 starters (or only 87 if you prefer those numbers). There’s a lot to be said for that argument.

ERA and RA9 in general are subject to a good deal of random variation and noise. These predictors were regressed against a sample of ERAs and RA9s that came from a range of 50.1 and 102 innings pitched. I think there’s a possibility that this analysis could be run again with the numbers from 2011, and we’d see a different predictor come out on top, solely because of that noise.

At the same time, I think these results should be taken as both a lesson and a cautionary tale. The ERA estimators that were tested (xFIP, FIP, SIERA and tERA) all did a better job of predicting future ERA than actual ERA; which was to be expected and is the normal assumption in the sabermetric community. But although they did better than ERA, simply subtracting walks from strikeouts did a better job of predicting ERAs for the second half than any of the advanced statistics.

I’m not trying to say that we should move away from FIP and other ERA estimators and simply use strikeouts and walks to attempt to predict how many runs a pitcher will give up in the future.

The highest r-squared (0.13055) I found came from K-BB/IP in the 87-pitcher sample. That number still tells us that more than 86 percent of the variation in second half ERA was still left unexplained by the predictor; which isn’t very good at all.

Instead, my point is that maybe we shouldn’t even be using the results of the first half to attempt to predict ERAs for the second half of the season.

For example before July 1, Kyle Lohse‘s ERA was 2.82, but his xFIP was 4.19. The normal assumption would be that Lohse had been lucky and we should trust his xFIP and assume that his ERA would regress negatively, in the second half.

His post-July 1 ERA is 2.81, essentially the same as it was during the first half. This is an extreme example, but I think it is something to learn from.

Maybe too often those in the sabermetric community simply assume that pitchers will regress toward their peripherals as the season goes on. But most of the time that regression doesn’t have time to occur in just half of a season.

Those who have read about sabermetrics long enough are probably sick of the phrase small sample size (SSS!!!). But, I think people who write about sabermetrics still fall prey to small sample sizes. I did when I began the idea for this article. I simply assumed that the ERA estimators from the first half would have a pretty strong correlation to second half ERA and RA9 numbers, and I was ready to write about which had been doing the best job this season. Then I found the results and realized that none had been doing well. And not only that, but something as simple as subtracting walks from strikeouts did better.

Therin lies the rub. In small samples baseball statistics are still very unpredictable, even when using the most “advanced metrics” that were created to to predict them.

So, next June when a starting pitcher has an ERA over five, but a SIERA in the mid-threes, please be wary of assuming that his ERA will regress over the next three months of the season.

**References & Resources**

All statistics come courtesy of FanGraphs and are updated through Sunday, Sept. 16.

aweb said...

Throwing all of the very strongly related variables into a single regression of course kills the significance – it’s not a degrees of freedom thing, it’s a multi-colinearity thing. If you want to stick with simple regression, you simply can’t feed the variables in like that.

Also, the Adjusted R-Sq result is more meaningful in cases like this to test whether the model has improved (even a random noise variable will marginally increase R-squared most of the time.)

Glenn DuPaul said...

@aweb

I agree. I was going to get into how all of the main predictors had strikeout and walks as main comments, which killed the significance. Using 6 predictors in 100 sample isn’t good form either though.

To note with the adjusted R^2, when all six are thrown in, the adjusted R^2 was .046 vs. the .104 for the ERA test. So the model really didn’t improve with all of the predictors in, if you consider the adjusted number to be more meaningful

Mark said...

Glenn,

Nice article and even nicer idea to study this! I like your statement “But most of the time that regression doesn’t have time to occur in just half of a season.” as it leads to another interesting idea to study the likelihood of all starting pitchers with large gaps between ERA and FIP by July 1 to continuing to have good or bad seasons.

Another similar idea about ERA and FIP I was working through relates to career ERA vs career FIP.

Jered Weaver (for example) has pitched 120+ innings for the past 7 years and has had a lower ERA than his FIP (and xFIP) 6 out of 7 times. Are we to say that he has been lucky 6 out of his 7 years as a MLB pitcher (spanning 1300+ innings)or that we should we have expected his ERA to regress towards .4 less than his FIP? Career ERA is 3.24 vs career FIP 3.65.

Kyle Lohse (your example) has pitched 90 + innings for the past 12 years and has had a lower ERA than his FIP just 5 times – not surprisingly his career ERA is higher than his career FIP. Does this mean he is “less” lucky than a guy like Weaver or should we have expected his ERA to regress toward his actual FIP from July 1st forward? Career ERA is 4.44 vs career FIP 4.34

obsessivegiantscompulsive said...

I’ve only skimmed part of the article, but I can say that this is a great article and I’ve been noticing your name in articles I read and enjoy, thanks and keep up the good work!

What I use to see where a pitcher might be headed in the second half is BABIP, and regression to their career mean. Nothing definitive because I don’t have the smarts to do that type of deep analysis, but I think that is valid enough.

I find it interesting that your found K-BB/IP to be a better estimator. Shandler’s Baseball Forecaster books turned me on to using K/BB ratio for rating pitchers and I have found that to be good. That is similar to K-BB, so I wonder how that would work as a predictor.

Again, great job, looking forward to seeing your next article.

obsessivegiantscompulsive said...

Matt Cain has been the poster boy for pitchers whose ERA is much better than their FIPs and xFIPs, as his situation has been discussed numerous times on Fangraphs previously. 7 seasons of 190+ IP, ERA better than FIP 6 of 7, 7 of 7 for xFIP. Career 3.30 ERA, 3.65 FIP, 4.21 xFIP, 3.66 tERA, 4.13 SIERA.

This is because he is one of those rare pitchers whose BABIP is statistically significantly under the .300 mean most pitchers regress to, .265 BABIP for his career, and even though in his career, he has rarely had a season above .270 BABIP, ZIPS predicts that he will end up over .270 BABIP despite his BABIP being only .265 right now with only 3 starts left.

And that’s a problem with all advanced sabermetric metrics that I’ve seen, they are all derived from DIPS, which is great, but then there is a whole class of pitchers who pitch great that is denigrated as likely to regress to the mean for the 6-7 seasons it takes a starting pitcher to compile enough IP to say that he is statistically significantly below the .300 BABIP mean.

Glenn DuPaul said...

@Matt

My opinion is that Lohse has been luckier than Weaver, because Weaver is a guy who has shown that he can outpitch his FIP consistently, while Lohse is not. So we’d expect Lohse to regress more, although over half of a season that did not occur

@ogc

I agree that pitchers like Matt Cain, Greg Maddux, Barry Zito and others who can consistently outpitch their peripherals, make advanced saber-stats less useful for those certain pitchers.

Thanks a bunch for the compliments, both of you

Mister Met said...

I love sabermetrics, and this article is great. I’m really glad you tackled this question, Glenn, and the results confirmed my suspicion, as something I’ve been yelling about for a while. These are good estimators, but lets remember what FIP stands for: Fielding Independent Pitching! It shouldn’t come as a surprise that it (and other estimators) perform poorly within a season, most of the time, the pitcher is pitching in front of the same defense. Obviously there are other factors at play as well, but too often, people have been using these tools to suggest future results without strong premises for doing so.

studes said...

Maybe too often those in the sabermetric community simply assume that pitchers will regress toward their peripherals as the season goes on. But most of the time that regression doesn’t have time to occur in just half of a season.Hey Glenn, great article. But I don’t understand this comment. You’ve just shown that pitchers do regress toward the mean of their peripherals in the second half of the season. Perhaps they don’t regress as far as you’d like, but the regression occurs and it’s real, isn’t it?

David P Stokes said...

Almost all stats are designed to tell us what happened, not what will happen. A stat will be useful as a projection of what will happen largely to the extent that it measures skill rather than performance, and even then only if the skill level remains constant. That’s a big reason why the simple metric of strikeouts and walks performed so well in this study—K’s and BB’s are more directly tied to a pitcher’s actual skills than almost anything else. It’s also why those 2 stats are so prominantly used in advanced pitching metrics.

Glenn DuPaul said...

@studes

The point you make is something I was wrestling with after finding the results.

My first thought is that this is just looking 100 pitchers from 2012, and that I have no idea if the results would be duplicated if we looked at 2011, ‘10 or any other year.

My second thought is that you’re right. Pitchers do regress towards their peripherals in the second half more than they I guess continue to perform at the same level, in terms of ERA.

My third thought is that if I showed that both they do regress and that regression is real, maybe I shouldn’t have said that regression doesn’t have time to occur in just half of a season.

This third thought for me is the key. Maybe the numbers didn’t regress as much as I would’ve liked (or expected) them to, but at the same time, simply subtracting walks from strikeouts did significantly better.

And that fact makes me nervous about simply assuming that starters will regress towards their peripherals (or advanced statistics) and makes me question whether or not we should just use strikeouts and walks or not even attempt to predict 2nd half ERA with just one statistic.

I think you make a great point and I hope I answered it clearly with my current opinion on the subject, which is still very much up in the air.

Mike said...

I think an interesting question would be how much do pitchers regress towards either their career peripherals/ERA estimators or towards forecasting systems like Zips, etc. Presumably, those would have much higher relationships with the rest of season performance. What I got out of your piece is that even with faster stabilizing peripherals we can still fall victim to small sample size. But if career numbers/projections have similarly low r-squared values, that would imply we know a lot less about pitcher performance than we think.

studes said...

Actually, that was kind of a confusing answer. I think the simpler version is that, within split half-seasons, the evidence is that pitchers will regress toward their peripherals—particularly strikeouts and walks. However, ERA/RA is still largely a random thing in the short run, and the regression isn’t particularly strong.

Glenn DuPaul said...

You summed up my point in a much clearer way than I did. And I think you’re probably right, that I should’ve stayed away from saying the “regression doesn’t have time to occur within the season”

philosofool said...

“That idea is fairly commonly accepted. If a pitcher’s xFIP, FIP and SIERA are significantly above or below his current ERA, then the assumption for most is that his ERA will move back either up or down toward those numbers.”

While I think most of us accept this, it doesn’t capture the underlying logic well. What I believe about a pitcher’s ERA in the second have is that it will be very close to his estimators *in that half.* Therefore, to the degree that I think his current peripherals project his future peripherals, I will accept his current estimators as good future estimators. Since I accept that first half peripherals are okay projections of his second half peripherals, as a first approximation, I believe what you say, that “If a pitcher’s xFIP, FIP and SIERA are significantly above or below his current ERA, then the assumption for most is that his ERA will move back either up or down toward those numbers.”

Todd said...

I see your Kyle Lohse and raise you Max Scherzer. Sometimes it does work exactly the way simple sabermetrics says it should =)