Occam’s Razor and pitching statistics

What is Occam’s Razor?

According to the Occam’s Razor Wikipedia page:

-It is a principle urging one to select from among competing hypotheses that which makes the fewest assumptions.

The razor asserts that one should proceed to simpler theories until simplicity can be traded for greater explanatory power.

Occam’s Razor is an influential principle that can be applied to vast array of complex topics, such as religion, outer space, science, economics and so much more. Yet, the phrase Occam’s Razor was the conclusion of Tom Tango’s response to my last THT article.

Why would Tango bring up Occam’s Razor in a discussion of baseball?

Well, for those who missed that article, I discussed baseball, but for the most part it was an analysis of baseball statistics; which Occam’s Razor could be applicable to.

In that article, I looked into how well ERA (earned run average) estimators were performing, this season. I accumulated a sample of 100 starting pitchers who threw at least 50 innings before July 1 and at least 45 innings after that date. I ran a linear regression to test which ERA estimator had done the best job of predicting second half ERAs, during this season.

Here’s a brief table of the results of a simple linear regression of the predictors against runs against per nine innings (RA9):

Predictor R-Sqaured RMSE
(K-BB)/IP 9.14% 1.207
SIERA 6.19% 1.246
xFIP 4.65% 1.267
FIP 2.92% 1.290
ERA 1.86% 1.304
tERA 0.43% 1.343

The results of that article brought about two surprising conclusions:

(1) I expected the advanced ERA estimators, xFIP, SIERA, tERA and FIP, to have a much higher correlation to second half RA9s than they actually did.

(2) I did not expect the simple baseline of strikeouts minus walks over innings pitched to be the most predictive of how many runs a pitcher would give up.

FIP (or fielding independent pitching) was the main reason I originally became interested in sabermetrics. For that reason (among others), I had always been a big advocate of ERA estimators. Thus, the second conclusion was rather confusing for me.

But the more I thought about it, I realized that maybe I shouldn’t have been confused at all.

FIP is essentially the same thing as strikeouts minus walks over innings pitched; it just also happens to include home runs. The other estimators also are made up, in large part, by strikeouts and walks.

In that study, the advanced ERA estimators fell short of strikeouts and walks, despite adding in other components that are supposed to add “greater explanatory value.” And suddenly, Occam’s Razor seems extremely relevant.

Is a simpler estimator actually better than any of the more complicated metrics?

I don’t think the original article came close to answering that question. It was a really interesting result and the results did back the “simpler” argument, but at the same time the sample was very small.

I looked at only 100 starting pitchers, and the predictors and outcomes came from only half of a season of baseball. I was looking at only 2012 data and half of a season of baseball is a small sample size.

So, for this article, I decided to see if the simply subtracting walks from strikeouts would be the most predictive again if I expanded the sample and tweaked the test slightly.

The study

For this test, I looked at how well ERA estimators would do in predicting RA9 (runs against per nine innings) for a subsequent season. I used data dating back to 2008 and looked at starting pitchers who had at least 125 innings pitched in Year X, and at least 100 innings pitched in Year X+1.

For example, I ran a linear regression to compare Zack Greinke’s FIP in 2008 to his RA9 in 2009.

The predictors include:

{exp:list_maker}FIP
xFIP
SIERA
I used K – (BB + HBP – IBB) / (PA-IBB), as a slightly better modification to K-BB / IP, that was used in the first article.
RA9
K% (strikeouts divided by batters faced).
{/exp:list_maker}
There were 344 starting pitchers who qualified for this sample and a simple least-squareds linear regression was run for each predictor.

The results

The three measures listed in the table below are the correlation coefficient (r), the r-squared and the root-mean squared error of the estimate.

The correlation coefficient and the r-squared work hand-in-hand. The correlation coefficient tells us about the strength of the linear relationship between the predictor and outcome, while the r-squared tells us the precent variation in RA9 in Year X+1 that is explained by variation in the predictor in Year X.

The root-mean squared error of the estimate also tells about the strength of the predictor. It works sort of like a standard deviation; thus, the lower the standard deviation (or RMSE) the better the model.

Here are the results of the six simple linear regressions:

Predictor r (r^2) RMSE
K-BB .370 (.137) 0.862
FIP .361 (.130) 0.865
K% .351 (.123) 0.868
SIERA .335 (.112) 0.874
RA9 .317 (.100) 0.880
xFIP .312 (.097) 0.881

The results were pretty much in line with what I expected. All of the r-squareds and RMSEs improved from the results of the original article, because using a full season of data to predict another full season of data is much more effective than using half of a season to predict another half.

All six of the predictors were statistically significant at predicting RA9.

The Occam’s Razor principle seems to be relevant again with these results. Simply subtracting strikeouts from walks and dividing that result by the number of batters faced was the most predictive of the six predictors tested.

Although we could’ve expected strikeouts and walks to be the most predictive, as the last study, and other studies have shown strikeouts and walks to be very predictive, it still feels like a slightly shocking result.

It might still be hard for some to wrap their heads around the idea that strikeouts and walks were better at predicting future RA9 than any of the other advanced metrics, especially when we consider that this is in no way a perfect sample.

I’ll be the first one to admit that this is sample is slightly flawed. The vast majority of the players in this sample did not change teams between the predictor and outcome seasons. The results for pitchers who throw in front of the same defense or in the home park will be biased for those metrics that are affected by defense (RA9) or home park (FIP). Despite this bias, strikeouts and walks were still more predictive than RA9, FIP or the others.

About a month ago, I wrote an article entitled “An argument for FanGraphs’ pitching WAR.” That article looked at the predictive ability of both FanGraphs’ WAR and Baseball-Reference’s WAR, but more importantly my sample only consisted onlyof starters who changed teams between seasons, from 2002-2011.

Interestingly, fWAR (which is FIP-based with adjustments) was more predictive than both rWAR (RA9-based with adjustments) and the metric that was most predictive in this sample (K-BB/PA).

The sample for this article stretched from 2008-2012, which is one season past the sample for the fWAR article, chronologically. Thus, using the data from the fWAR article I could only take the sample of starters who changed teams from ’08-’11 to compare to this test. That sample includes was just 49 pitchers, which is a small sample, but the r-squareds for those pitchers are fairly interesting:

Predictor r^2
K-BB/PA 0.175
rWAR/PA 0.081
fWAR/PA 0.116

The simple strikeout-to-walk predictor seems to be much more predictive than the FIP or RA9-based advanced metrics, during these seasons for starters who changed teams.

I stretched the results back to 2005, to include 114 starters and got this result:

Predictor r^2
K-BB/PA 0.175
rWAR/PA 0.107
fWAR/PA 0.132

The predictive value of strikeouts and walks fell almost completely apart for the last three years (2002-2004) that I tested. This caused fWAR, or more simply FIP, to end up looking like the more predictive statistic. I really have no idea why this happened, but plan on investigating the results for my article next week.

Conclusion

The goal of this piece was to investigate whether simply taking strikeouts and subtracting walks would be the most predictive with a larger sample. Occam’s Razor seems to be applicable again, as strikeouts and walks were the most predictive.

But what if we took Occam’s Razor one step further and asked whether we should even include walks as a component in our predictor?

It feels extremely foolish to question whether walks matter. Theoretically, a pitcher who walks more batters will end up giving up more runs. It seems hard for anyone who is in tune with the game of baseball to think that issuing a lot of free passes is a good thing. But the results don’t exactly back that conclusion.

Until this point, I’ve failed to mention that strikeout percentage was third-best at predicting RA9. This metric was even simpler than (K-(BB+HBP-IBB)/(PA-IBB), as it considered only strikeouts per plate appearance.

Looking at just one metric, strikeouts, and dividing that number by how many batters the pitcher faced outperformed xFIP, SIERA and RA9. Adding in walks (or in the case of FIP, home runs and walks) only improved the prediction model slightly.

So, I tested the relationship between walks (BB / PA) and RA9. We’d expect a positive relationship between walk rates and RA9, because we assume (in almost all cases) that a pitcher who walks more batters in Year X should have a higher RA9 in Year X+1.

The relationship was positive, but extremely weak, with an r-squared lower than one percent (0.0592).

While the combination of walks and strikeouts was the most predictive, almost all of that predictive value seems to be coming from the strikeout rate.

In this test, I stripped a bunch of established ERA estimators down to the one single metric at their core, and they lost little to no accuracy.

What do we do with that conclusion?

Well the first obvious reaction is the one that I keep coming back to, Occam’s Razor. That is to say that no matter how far we’ve come in the world of researching baseball statistics, we should always select “the hypothesis which makes the fewest assumptions.” Maybe when predicting future RA9s, looking at just strikeouts, or the combination of strikeouts and walks, would be more beneficial than any of the fancier metrics.

The second (also quite obvious) reaction is that my sample size is still extremely small. I looked only at data for starting pitchers over the course of just four seasons. I also looked at projecting only year-to-year runs allowed, which are subject to extreme amounts of random variation. Strikeouts minus walks was the most predictive, but it was in no way the “oracle of RA9 prediction,” if you will.

K-BB’s r-squared of 13.7 percent leaves a lot to be desired. And although one of the simplest metrics was the most predictive, these results reiterate the point that it’s still very difficult to predict year-to-year RA9. Also, to take the devil’s advocate case one step further, strikeouts minus walks didn’t beat the other predictors by a whole lot, and there’s a chance that a different sample, under another set of assumptions, would’ve led to different results.

Despite those words of caution, I think as of right now when it comes to pitching metrics, simpler is better.

References & Resources
Data for this article come courtesy of FanGraphs.com and Baseball-Reference.com.


2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Paul G.
11 years ago

It seems it would make sense that the K% would be the most important as it is a proxy for hits surrendered.  Pitchers do give up more hits than walks, usually.  By extension K% is a pretty good proxy for baserunners surrendered so walks are indirectly included to a certain extent.

rempart
11 years ago

I have gotten similar results when doing roughly the same thing for my Fantasy points league. I would also add, I have found that just K% works better than K-W for relief pitchers in the following year. As to the 13.7%, RA9 the following year would be better correlated if some stats from the previous year were considered that are not used by the run predictors. For example take LOB%, my research has shown a reasonable correlation between a rising/falling RA in one season with the next, and an inverse LOB% the following season. A starter has a 4.50 ERA one year and a strand rate of 65%. The following year this rises to 74%, and his RA drops off.Of course there is alot of luck, but it helps explain some of what you are stating in the article. There are of course other luck elements involved like Babip. Interesting work!