Should we ever use a complex ERA estimator?
by Glenn DuPaulOctober 24, 2012
In the past weeks I've written thousands of words about estimating runs allowed for pitchers. I've probably written more than was necessary on the subject, and it really has been a more taxing experience than I originally expected. After last week's piece on relievers, I had just about run out of ideas for testing future runs with the approach that I was using. Instead of letting go of the subject for some period of time, I reached out for suggestions as to what else I could test in the future.
Some of the most sophisticated minds involved with sabermetrics, Colin Wyers and Brian Cartwright, both gave me good ideas.
Colin suggested that instead of trying to re-weight fielding independent pitching (FIP), which has been my primary concern for a few weeks, I should try to estimate future strikeouts, walks and home runs (the FIP components), instead of runs.
Brian suggested that instead of using walks and strikeouts to re-weight FIP, I should try to use contact and control to create an estimator.
Both were great suggestions that I plan on getting to in the coming weeks; however, their suggestions, brought about one final (hopefully) idea for me and testing my statistic, predictive FIP (pFIP).
The True Talent Idea
Expected Fielding Independent Pitching (xFIP) is a version of FIP that does not use the number of home runs that a pitcher actually gave up, but instead uses the number of home runs that they were expected to give up (or should have). The idea of xFIP lead me to the idea of an expected predictive FIP or (x pFIP).
My idea for a regressed version of pFIP went beyond that of home runs, but I wanted to include walks and home runs, as well.
The majority of the methodology behind this regressed statistic comes from applying the idea of xFIP with the ideas championed by Russell Carlton, Derek Carty and Harry Pavlidis. Their work dealt with the idea of "stabilization" and true talent level.
Their work in simple terms showed that after a certain number of plate appearances a statistic will reach a required correlation coefficient (r). For example, given an r of 0.50, we can consider half (50 percent) of that number to be their skill level and regress the other half back to the league average of that statistic.
Basically, I'm trying to apply the idea of regression to the mean, to each statistic (K,BB,HR) adequately, to get a better prediction model.
The concept resulted in the use of expected strikeouts, expected walks and expected home runs as the three components of a "x pFIP".
The Study
Predictive FIP is metric created to predict runs starting pitchers; thus, for this test, I only looked at starting pitchers.
I took a sample of starters from 2004-2012 (n=731) who had at least 100 innings pitched in Year X and at least 100 innings in Year X+1. I then regressed their strikeout percentage (K/PA), walk percentage (BB/PA) and home run percentage (HR/PA) in Year X against those percentages in Year X+1, to find r's for each statistic.
I used HR/PA instead of home run per fly ball, because I tend to stay away from using batted ball data due to the biases in that information.
Here are the correlation coefficients I found:
| Statistic | r |
|---|---|
| K/PA | 0.77 |
| BB/PA | 0.68 |
| HR/PA | 0.366 |
I used these numbers to create an xK%, xBB% and xHR% to use for pFIP. For example, when finding xK% I took strikeouts' r (.77) and multiplied that number by the starter's strikeout percentage, and added that number to the league average K% for starters in that year multiplied by .23.
I ran a multiple regression with these numbers against the starter's next season runs allowed (RA9), and found this statistic for x pFIP:
x pFIP = (50*xHR) + (10*x(BB-IBB+HBP))- (11*xK)/PA + Constant
I tested this number against my original pFIP equation, which used the raw numbers, and used r-squared (r^2) as my measure of choice. These r-squareds tell us the percentage of variation in runs allowed in Year X+1 explained by the estimator in Year X:
| Estimator | r^2 |
|---|---|
| x pFIP | 20.12% |
| pFIP | 19.74% |
The regressed version of pFIP was more predictive than my original raw statistic; however, the difference in r-squared is less than 0.4 percent, which is within the margin of error.
From this sample, there was not conclusive evidence to back the assumption that the more complex, regressed, "true talent" predictor would be better predictor of future runs than the original raw statistic.
So, I tested the x pFIP that I found for a sample of starters (same min. 100 IP) for 1996-2004 (n=705) to see if x pFIP would continue to be more predictive (if only slightly) than the unregressed pFIP.
Here are the results:
| Statistic | r^2 |
|---|---|
| pFIP | 23.04% |
| x pFIP | 22.32% |
The two r-squareds were very close, but the raw pFIP estimator came out ahead of the regressed version. This, again, was not conclusive evidence for the regressed estimator being the more predictive stat, and, in fact, was evidence against it.
Before I could completely throw out the idea that using regressed versions of strikeouts, walks and home runs would improve the statistic, I found the correlations between Year X K,BB,HR and Year X+1 for the 1996-2004 sample to see if using those numbers would improve x pFIP:
| Statistic | r |
|---|---|
| K/PA | 0.787 |
| BB/PA | 0.731 |
| HR/PA | 0.401 |
I used these r's to find new xKs, xBBs and xHRs and combined those in a multiple regression to find a new x pFIP equation based on this sample:
((33*xHR)+ (10*xBB)-(12*xK))/PA + Constant
This x pFIP equation resulted in an r-squared of 22.63 percent, which is only a marginal improvement over the statistic found in the first sample, and still falls short of the raw pFIP.
Conclusion
My attempt to regress the components of predictive FIP towards the mean, or true talent level, for the starters in these samples either only offered a negligible improvement over the original statistic, or actually hurt the predictive ability of the stat.
I'll be the first to admit that this finding probably sounds irrelevant. But I think it does a great job of reflecting a larger over-arching theme.
I lead off this article by saying, I've written a ton over the past month about estimating runs allowed. At the start, I had no plans of writing more than one or two articles on the subject, yet the total is now up to six. I never imagined that I would come up with my own idea for a statistic, but now I've written four articles about pFIP.
The one recurring theme that I have found to be true through all of the different tests I've run from the different samples I've gathered, is that of simplicity.
I've yet to find a single piece of evidence to back the assumption that a complex estimator was better for predicting runs.
My first test looked at how ERA estimators performed within season for starters, and strikeouts minus walks was the best predictor. I then tested starters on a season-to-season basis, and K-BB was the best again, with FIP a close second. When I tested two seasons of work to predict the next season, basic FIP was the best predictor.
These findings, led to my development of predictive FIP (pFIP), which is as simple of FIP, just with different weights.
Predictive FIP beat all of the other estimators, simple and complex, in predicting future runs for starters, over various large samples.
Finally, I tested to see how well pFIP worked for relievers. pFIP is statistic was created for starters, but worked really well for relievers, as well. However, it only brought a marginal improvement over the extremely simple statistic of strikeout percentage.
All of my tests reiterated this point about simplicity.
There's a chance that the more complex estimators work better for special circumstances, like pitchers who change teams, throw a ton of groundballs, or are remarkably adept at keeping the ball in the yard. Also, they could add descriptive value over simply using the combination of FIP and batting average on balls in play, but I have yet to hear a really great argument explaining that fact.
What I can tell you though is that when a predictor is made more complex, it must add more predictive value, by a significant amount. If it does not, then my only response can be just two words:
Occam's Razor.
References and Resources
All statistics come courtesy of FanGraphs
Glenn is an Economics major at Lehigh University. He works as a Research & Development intern for Baseball Info Solutions. He also writes about sabermetrics for Beyond the Box Score. You can follow him on twitter @Glenn_DuPaul or email him at .(JavaScript must be enabled to view this email address)
<< Return to Article