Reinforcing the power of predictive FIP

In October, I introduced a metric called Predictive FIP (or pFIP for short). This metric is a slightly modified version of Tom Tango’s commonly used fielding independent pitching (FIP) statistic.

Tango’s version of FIP is meant to describe a pitcher’s performance in terms of the three true outcomes (walks, strikeouts and home runs). The FIP equation weights each of those three outcomes in a descriptive manner:

FIP = (13*HR + 3*BB – 2*K)/IP + Constant (typically ~3.20)

FIP works fairly well as a predictor of future ERA or runs allowed (RA9); thus, many use the statistic to predict, despite the fact that it is not meant to do so. A good way to think about FIP is what a pitcher’s ERA should have been, or better yet, what his ERA would be based solely on Ks, BBs and HRs. FIP is not meant to tell us what a pitcher’s ERA is going to be in the future.

I set out to convert FIP from its descriptive form into a predictive metric.

After a few tests and some advice, I changed some of the methodology behind FIP. First, the FIP weights and constant are meant to describe ERA; I decided make pFIP a predictor of runs allowed per nine innings rather than ERA. Second, I made plate appearances (or batters faced) the denominator of the statistic rather than innings pitched.

The result was this equation:

pFIP = (17.5*HR + 7*BB – 9*K)/PA + Constant (typically ~5.18)

The major differences between FIP and pFIP come in the weighting of strikeouts and home runs. Strikeouts become more important when predicting future runs, while home runs become less important.

pFIP held up very well against other more commonly accepted “ERA estimators” (including descriptive FIP). That being said, just because something works fairly well does not mean one should not at least attempt to improve it.

A while back, I attempted to reform pFIP by regressing each of its components (Ks, BBs, HRs), to the mean. Strikeouts and walks are less volatile over one to two year samples; thus, their regression was not nearly as significant as the regression for home runs. Interestingly, regressing the components to the mean, did not improve the metric.

My next idea to improve pFIP was to focus only on the home run component of the statistic.

Dave Studeman, the leader of the Hardball Times, converted Tango’s FIP into a version known as expected fielding independent pitching (xFIP).

According to the THT Glossary, xFIP is:

An experimental stat that adjusts FIP and “normalizes” the home run component. Research has shown that home runs allowed are pretty much a function of fly balls allowed and home park, so xFIP is based on the average number of home runs allowed per outfield fly. Theoretically, this should be a better predictor of a pitcher’s future ERA.

The FanGraphs Sabermetrics Library explains how xFIP is calculated:

A Hardball Times Update
Goodbye for now.

(xFIP) is calculated in the same way as FIP, except it replaces a pitcher’s home run total with an estimate of how many home runs he should have allowed. This estimate is calculated by taking the league-average home run to fly ball rate (~9-10 percent depending on the year) and multiplying it by a pitcher’s fly ball rate.

Over most small-to-medium samples xFIP is a better predictor of future than FIP; thus, I decided to apply this concept to pFIP.

xFIP simply inserts the expected number of home runs directly into the FIP equation:

xFIP = ((13*(FB% * League-average HR/FB rate))+(3*(BB+HBP))-(2*K))/IP + constant

I decided against inserting the expected number of home runs into the pFIP equation with its current weights.

An attempt to contrive an xpFIP

I took a sample of starting pitchers who had at least 100 innings in Year X and at least 100 innings in Year X+1 for the years 2007-12 (n = 479).

Then, I ran a multiple regression with strikeouts, walks and flyball percentage times the league average HR/FB in Year X against RA9 for each starter in Year X+1. This regression resulted in this regressed or xpFIP equation:

xpFIP = ((5*FB%*League-average HR/FB rate))+ (9*BB) + (9*SO) )/PA + constant**

**In this case the constant was 5.23**

By estimating the home run total, the home run coefficient of pFIP is only about half of the weights of Ks and BBs, as opposed to being weighted twice as much as those two coefficients in the original equation.

Then, this xpFIP equation was tested against these other ERA estimators:
{exp:list_maker} pFIP
FIP
xFIP
kwERA
SIERA{/exp:list_maker}I ran a linear regression, on the same sample, between each starter’s ERA estimator in Year X and his RA9 in Year X+1.

I used r-squared as the measure of the predictive value of each estimator, and found these results:

Predictor r^2
pFIP 18.50%
xpFIP 17.78%
kwERA 17.73%
SIERA 15.63%
FiP 15.33%
xFIP 14.82%

This new xpFIP equation did fairly well, beating almost all of the other estimators tested. However, regressing the home run component hurt predictive ability of the original pFIP; which was the strongest predictor.

Before scrapping the idea of regressed home runs in pFIP completely, I tested the equation on a different sample. I used the same minimum requirements (100 IP) and the same estimators and ran the same linear regression for the years 2002-07 and found these results:

Predictor r^2
pFIP 19.19%
SIERA 16.56%
FiP 16.33%
xpFIP 16.03%
kwERA 15.79%
xFIP 15.29%

The xpFIP equation did not predict future RA9 nearly as well for this sample. My original pFIP equation did significantly better than the other ERA estimators at predicting future RA9.

Why does the pFIP with a regressed home run component do worse than the non-regressed pFIP?

It’s interesting that the statistic that uses actual home runs is more predictive than the regressed version, despite the random variation that affects home run numbers.

My best guess for the reason behind this finding has to do with survivor bias. It has been shown that some pitchers have the ability to suppress home runs and consistently have lower than average HR/FB rates. I think it is entirely possible that a fair number of pitchers who are allowed to throw 200+ innings over the course of two seasons have some ability to control their home run rates.

Also there is the issue of park factors. The majority of these players did not change teams during the span of two seasons. It makes abstract sense that a pitcher who made half of his starts in a park that suppressed home runs would have a lower than average home run rate over those two seasons, and vice versa for a pitcher in a home run-friendly park.

I think it’s well within the realm of possibility that regressing the home run component of pFIP would benefit the statistic when looking at pitchers who change teams between Year X and Year X+1.

pFIP vs. ZIPS

At this point, I’m pretty confident in the strength of pFIP as a predictor.

However, I had always simply assumed that projection systems were more useful, as they consider many more factors other than just the three true outcomes, when attempting to project future runs for pitchers. (Although, this Matt Swartz article caused me to be a little uncertain about that opinion.)

So, mainly for fun, I compared pFIP’s RA9 projections for last year (2012) to the RA9 projections of the popular ZIPS projection system.

First, I looked at a sample of every pitcher who threw at least 100 innings in 2011 and at least one inning, in 2012 (n=137) and compared how well each system (or metric did) at projecting future RA9:

Predictor r^2
pFIP 17.72%
ZIPS 14.65%

Much to my surprise, pFIP explained over three percent more of the variation in RA9 than ZIPS. However, my minimum inning threshold for 2012 (one!!) was admittedly silly.

Thus, to eliminate some outliers and converted relievers, I set the minimum threshold in 2012 to be at least five games started in the season (n=118). I found these results:

Predictor r^2
pFIP 19.84%
ZIPS 17.20%

This change improved the predictive ability of both systems, and closed the gap slightly between pFIP and ZIPS. Interestingly though, pFIP still came out ahead of the much more sophisticated system.

This is very obviously a small sample. I looked at starting pitchers in only one season; thus, it could have been pure luck that pFIP was a better predictor of future runs than ZIPS. Also (and more importantly) ZIPS and other projection systems are built to predict many more factors (IP, GS, Ks, BBs, etc.) than just runs.

At the same time, I think these two short studies (regressing home runs and comparing to ZIPS), do a fair job at reinforcing the strength of this simple predictive re-weighting of the FIP equation.

References & Resources
All data comes courtesy of FanGraphs


14 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Barry
11 years ago

Glenn,

One question on the ZIPS comparison. Were any 2012 data used in determining the coefficients in the pFIP model used in the comparison?

Glenn DuPaul
11 years ago

When I determined the coefficients I used 1996-2012, so short answer is yes. 

That’s part of why I admitted the comparison was more fun than anything else.  At the same time, over those years tested the coefficients were fairly stable, so if 2012 wasn’t included the pFIP model would be almost exactly (if not exactly) the same. 

It will be interesting to see how pFIP holds up against other projection models for the 2013 season.

dcs
11 years ago

Looking at the pFIP formula, the HR weight is about twice that of BB and K. Why not simplify the formula and just use (K-2HR-BB)/PA ? There’s no big reason it has to be on the RA scale.

Glenn DuPaul
11 years ago

@dcs

The constant is what really puts pFIP on the RA9 scale.  You can use the same weights and regress to ERA and you’ll see similar results.

As for simplifying the weights, using (2*HR + BB)- K))/PA will give you almost identical results:

07-12: The r-squared goes from 18.50 to 18.67 (the simple model improves it)
02-07: r^2 moves from 19.19 to 18.75. 

The difference is essentially negligible. The more complicated weights are meant to improve the model slightly, as home runs aren’t exactly twice as important as Ks and Ks are slightly more important than walks. 

However, I think your suggestion is a good thing to think about.  When projecting future ERA (or RA9) you can simply consider home runs/PA to be twice as important as K% and BB%. 

FIP can be considered in the same way.  A simple FIP formula would be (7*HR + 1.5*BB) – K))/IP.  The exact weights only slightly improve the descriptive qualities of the model.

obsessivegiantscompulsive
11 years ago

The more I see these articles defending FIP and related matters, the more I find your research interesting.  Nice job!

My stats is very rusty, so maybe you can explain if I understand things correctly.  Firstly, I understand that by your methodology, your pFIP is better than most other, some by a little, but a lot better than FIP or xFIP.  The very interesting point there is that xFIP is less descriptive, suggesting that HR/FB is not the standard that was thought.

Second, no matter which system is used, predictive ability is less then 20% (this is where my stats is rusty) predictive of the variance found.  So while pFIP might be the best of the group, as a whole, it really only explains 20% of the future ERA.  In other words, there is 80% out there somewhere that explains future ERA.

This brings up two huge fantasy baseball questions for me.

First off, why are people so vehement arguing for one pitcher vs. another when FIP and its improved variants explain less than 20%?  FIP, if I understand this low number, can indicate how good the pitcher is, but is not the final word, since it explains less than 20%, again, it seems to me with my rusty stats. 

Second off, what explains the other 80%+?  I don’t really recall anything other than fielding being responsible for that.  Assuming that is true, shouldn’t more of the fantasy discussion on all the various saber sites regarding pitchers start with the defense behind them?  Has there been any study to look into, say, a team’s FIP vs. some defensive metric, to find out what the opposite is? 

Going beyond that, there was a great study on BBTN long ago, doing a regression to find the run value of OBP and SLG by lineup positions, so I was thinking it would be interesting to see something similar to that for pitchers.  Maybe it could simply first be a function of the pitcher’s FIP and his team’s DER.  Or team’s UZR.  Or team’s DRS.  Or maybe look at all three, just to see what looks to be the best. 

I found that lineup regression to be very useful, that still seems to hold even in the reduced run environment we are now in, on a team basis, as I guess a base-out state is a base-out state, no matter how frequent or infrequent they are.  It would be great to see if such a regression holds up in reality or not for pitching.

rubesandabes
11 years ago

“The constant is what really puts pFIP on the RA9 scale.”

Yes, it is a phony number, added to another MUCH SMALLER number to somehow make the sum look more baseballish, whatever that means.

Glenn DuPaul
11 years ago

@rubes not sure what you’re trying to say? FIP is a very small number that doesn’t look baseballish until you add a constant.
That small number can tell you a lot, even before the constant is added, it also can tell you a lot more than a pitcher’s RA9 in the previous season.

rubesandbabes
11 years ago

Hi Glenn,

Consider my two posts a friendly flame on behalf of the Miguel Cabrera for MVP crowd, all the right and just Bissingerians of the baseball universe.

I am making the complete criticism of the FIP stat and your work with FIP over the fact that all results by and large closely resemble the constant, and that there really isn’t a reason for the constant. Effectively, you are comparing Zips results with the numbers 3 and 5.

Likewise plugging in FIP to calculate pitchers WAR is pretty much starting out with a fake data set. The constant IS the FIP, especially for the best guys.

I do appreciate you worked it out to multiplying the “home run component” stat by 5 instead of 13 – yes, I read the article.

I didn’t get all that 80% stuff, but you did demonstrate pretty well that the complicated predictor stats are sorta reliable work most of the time – nice read, thanks.

Just out here fighting for the right to ignore any baseball discussions talking about R squared and home run components, since the elderly couple sitting next to me, and the family sitting in front of us certainly are doing so.

Glenn, thanks again for your reply.

Glenn DuPaul
11 years ago

@ogc

Yes, no matter what statistic you use, even pFIP, over 80 percent of the variation in future RA9 is left unexplained. 

I’ve found that closing that 80 percent gap is nearly impossible.  Park factors, defense and strength of opponents could go some of the way in explaining that unexplained portion, but I don’t know how far they would go.

The problem is explaining one season of RA9 is very difficult, because the sample is so small, random variation plays a large part. 

There’s the issue of BABIP, which in a 150+IP 75 percent of the number is random. Defense only affects 13 percent of BABIP, on average. 

So while defense and park could explain some, RA9 is so random that in a one season sample, even the best predictive model is going to leave a lot to be desired.

Glenn DuPaul
11 years ago

@rubes

I honestly don’t know what to say.  You can run the comparison for pFIP with ZIPS or with the other ERA estimators without the constant and simply use the coefficients and you’ll get the exact same result.  The constant is just a scaling point, that is an attempt to make the information look like ERA or RA9. 

FIP is the same statistic whether or you add 3.2 to it or not.  It is no way a fake data set. Tango’s FIP takes what actually happened (HRs, BBs, Ks) and weights those outcomes by their run values to get a very accurate assessment of a pitcher’s performance.

The constant is not the data, it is just an anchoring point. Run all of the data without the constant and you’ll see that it is irrelevant.
 
The most significant impactful part of this exercise is the fact that the home run coefficient gets regressed and strikeouts become weighted more as they are more predictive.  Adding two to the constant does not matter.

rubesandbabes
11 years ago

Glenn,

Sorry, no – thanks again for your reply, in any case.

I fully appreciate you have re-jiggered the calculation to be basis Plate Appearances Against vs. the other denominator which is IP (How many outs achieved).

Yes, one doesn’t even have to consider whether Glenn has done awesome work or not with the changing around of the component multipliers,…

Because the simple fact is the results have gone from 3something to 5something, mainly reflecting the change in this size of the constant more than anything else.

Changing the size of the constant is much more significant than changing 3Walks over IP vs. 7Walks over PA.

You added two – fine for a scale, but you are trying to make an arbitrary constant look equal to Runs Allowed, and then comparing it around. An arbitrary constant might look nice, but it is a not runs allowed, nor anything like it.

The Bissingerians rest their case!

rubesandbabes
11 years ago

Yes, well Bissingerian law technically prevents me from asking, but if it could be explained how Glenn arrived at the constant 5.18 in one instance, vs. 5.23 in another? A point in the right direction would do, rather than a technical explanation an average fan can’t understand.

I haven’t run the numbers on say Matt Cain’s 2012 through these various calculations, but I don’t have to. He goes from 3something to 5something.

And much like an old-time scout discounting a RHP due to lack of height, these models don’t really do any more than FIP to clarify the future for the Kyle Lohses and Trevor Cahills…Strikeout a lot of guys, don’t walk the leadoff man, don’t give up too many bombs, predictive stats will love you (of course).

Tom
11 years ago

Yeah, you might want to actually run those numbers.  Hint: even though pFIP is on the RA9 scale, which is ~1.08x higher than the FIP/ERA scale, his pFIP from 2012 stats still starts with a 3.

rubesandbabes
11 years ago

Yes, I guess a simpler way to say it, is that it is fine to scale the various results of whatever FIP calculation using a constant, but when the results are aplied/compared elsewhere, the constant, which is a lot larger number than the calculation result, becomes the data. 

Glenn’s adding 2 to the constant is by far the most significant / impactful part of this exercise.