Our friend Brian Joura of RotoGraphs posted an article today citing my own article about the problems with FIP from earlier in the year. My assertion from then, which I still stand by completely:

While the original, underlying premise for FIP is sound, and while it’s absolutely better to use than simple ERA, and while there are certainly uses for FIP in some circumstances, for 99 percent of fantasy purposes, I ignore FIP completely and absolutely.

I noticed a few comments to Brian’s article that didn’t seem to completely buy my explanation, so I thought I’d run some quick numbers to help provide further evidence that a stat like LIPS or xFIP is better than FIP.

### HR/FB instability

By definition, the only substantial difference between FIP and xFIP is that xFIP adjusts each stat line to assume a league average HR/FB, so this crude study will focus entirely on HR/FB.

I looked at all pitchers with at least 12 games started in adjacent seasons from 2004 to 2008. Over this period, we find 63 pitcher seasons where a pitcher’s HR/FB strays at least four percent from league average* in Year 1. In Year 2, just 5 of those 63 pitchers (7.9%) failed to regress in the direction of league average. That’s a very small number, especially when you consider that Chien-Ming Wang (who may be one of the rare exceptions I mentioned) and Brett Myers (who almost certainly is one of those rare exceptions) accounted for 2 of those 5 seasons. Exclude them, and the percentage becomes 4.8%.

This is a very crude study, but hopefully it reestablishes my point. HR/FB is unstable and because FIP makes no alterations, it will be misleading and less accurate than other indicators. David Gassko did some much more thorough work on HR/FB in the THT Annual 2007 (which can be read for free here), but the short version is that for pitchers with 350+ TBF, the previous season’s HR/FB explains just 3% of the variance of the following season’s HR/FB.

*I used a rough estimation of league average, using the aggregate league average for all five years. This is the lazy way to do it but won’t change my point.

### Anecdotal evidence and precision

One comment from Brian’s article that I thought would be useful to answer for everyone:

“Well…FIP definitely helped predict Ricky Nolasco’s turnaround. Not sure what his xFIP was….”

We must remember that FIP is not so utterly useless that it will be incorrect in every scenario. In scenarios where the pitcher has a lucky or unlucky BABIP or LOB% (Nolasco’s BABIP was over .400 at one point), FIP will be able to predict the general direction the pitcher’s ERA should move as long as the HR/FB isn’t too far away from league average.

While we’ll know that Nolasco isn’t a 6.00 ERA pitcher, it is important to make a distinction over whether his ERA should be 4.50 or 4.00 or 3.50. Even the difference between a 4.25 and 4.00 ERA is the difference between ‘solid starter’ and ‘waiver wire material’ in many leagues. FIP is ill-equipped to make this distinction.

We can’t allow anecdotal evidence to rule our decision making. While FIP may have worked in Nolasco’s case given a very rough objective, the numbers tell us that a stat like xFIP or LIPS will be more accurate, for more pitchers.

Brandon Heikoop said...

So it’s not that FIP is “bad” for fantasy analysis, it is that looking at FIP without context is bad fantasy analysis. One could have recognized this under any circumstance as looking at any one statistic without context is bad analysis in any spectrum, baseball notwithstanding.

Hockey fans, which aren’t as stat savy as baseball fans, know that you cannot simply look at a goalies GAA and deem him #1, there are other circumstances that must be evaluated (SV%, SOG, SHO, PPGA, SHGA, etc).

Basketball fans, who are becoming more stat savy, know that the shooter with the highest FG% is not the best “shooter” in the league.

So if I were writing this article, my assertion would be that looking at anything without context, or in a bubble, is invalid analysis.

Derek Carty said...

Brandon,

I suppose that’s one way to look at it. But when we have readily available stats like xFIP, I don’t think it makes much sense to look at FIP and try to make subjective estimates about context. All the necessary adjustments have already been made for us. My beef is mostly that analysts continue to use FIP without looking at context and treating it as the exact number that a pitcher’s ERA should be, even when it’s completely off because of his HR/FB. If a pitcher’s HR/FB isn’t exactly league average, I just don’t think looking at FIP is the smartest move when a clear upgrade is there for the taking.

Josh,

Stats like FIP, xFIP, LIPS, etc implicitly assume a league average BABIP and LOB%. By excluding hits and weighting things like K, BB, etc, this accounts for stuff like BABIP.

K76154,

Yeah, sorry, LIPS isn’t available yet. I’d recommend using xFIP.

TheKid said...

This is a great back and forth discussion you have begun with Mr. Joura. I love both sites and read both regularly.

Josh said...

What if we normalized for BABIP and Strand Rate as well? Would that be more useful? Is it hard?

K76154 said...

I want to use the LIPS ERA instead, but the problem is that the major websites does not provide LIPS ERA data, and it’s hard to calculate by myself.

Will Larson said...

Derek and others: The stats that really matter are the contact stats (hr/fb, gb%, ld%, iffb%) and the non-contact stats (k/9, bb/9). From these, you can predict how a pitcher will do much better than with FIP or any of the others. BABIP, LOB%, and ERA are all heavily dependent on the type of contact a pitcher induces. Thus, a pitcher with a very high fb% will have a permanently low BABIP but a higher hr/9 than the league average. Check out http://www.williamlarson.com/?p=95 or http://www.williamlarson.com/baseball_document.pdf for more info.

Derek Carty said...

William,

You’re absolutely right that these are stats that we should be looking at for pitchers. If you go through the THT Fantasy archives, you’ll see countless examples where we use them.

However, I believe you might be missing the point of stats like FIP, LIPS, etc. What you cited are component skills. What ERA Estimators look to do are combine certain component skills in the proper proportions to estimate what a pitcher’s “luck neutral” ERA would be.

Sure, we can look at K/9, BB/9, GB% etc, but we don’t know the exact contribution each has to ERA in our heads, and we certainly can’t run those calculations in our heads each time we see a shift in one of these numbers. That’s why we have ERA estimators. They do the work for us. We can look at the component skills to see which is moving and determine whether the LIPS ERA shift is more likely to be permanent or things of that nature.

You said that “from these, you can predict how a pitcher will do much better than with FIP or any of the others”. Well, not necessarily. Not just by looking at them anyway. You’d need to weight each one properly and combine them and then you could, but then we just be deriving another ERA Estimator like FIP, LIPS, etc.

My beef this whole time with FIP has been because FIP chooses the wrong component skills for its formula and people continue to use it as if it’s gospel. HR/FB is heavily luck dependent, so it shouldn’t be treated as a full-blown skill, which FIP does. LIPS, on the other hand, uses K, BB, six batted ball types, numerous outcomes for each batted ball type, park factors, LD% normalization, and a number of other things to determine its version of expected ERA. xFIP at least considers K, BB, and GB% (inherently), the big ones we need to worry about.

Also, while you’re right about extreme FB pitchers having lower BABIPs, the difference isn’t enormous, and because there are few of these kinds of pitchers we can treat them just a little bit differently. And even if we don’t treat them differently at all, our results would still be more accurate than to use actual BABIP.

As far as LOB% goes, I don’t believe that the types of batted balls a pitcher induces has much effect on it unless the pitcher is an extreme, extreme FB pitcher. In 2008, league average was 71.4%. For pitchers with GB% >= 50%, it was 71.4%. For pitchers with GB% <= 35, it was 71.9% If we drop GB% below 30%, we get 70.1% for 2004-2008. If we drop it below 25%, then we see some shift to 65.4%, but that’s in a somewhat small sample size.

Anyway, I hope that covers anything. If you disagree, feel free to let me know and I’ll clarify. I’ll also be sure to take a read through your site.

Will Larson said...

Derek, thank you for your thoughtful response. I think this is a great conversation and one that people should be having more often.

I agree with most of what you’re saying. However, I think you underestimate the effect batted ball stats have on ERA.

First, batted ball stats affect BABIP. Regression analysis shows that a 10% point increase in GB% from LD% reduces your expected BABIP by 50 points (from .300 to .260 for example). A 10% point increase in GB% from FB% increases your expected BABIP by 10 points (from .300 to .310 for example).

Where batted ball stats really kick in are in LOB%. If you have a higher expected BABIP, you have a lower expected LOB%. This effect multiplies the impact of batted ball stats on ERA because small changes in batted ball stats affect BABIP which then affects the percent of runners left on base. I’m not sure if you’ve considered this multiplier effect.

Finally batted ball stats affect the HR/9 stat. This is pretty obvious to most, but it shouldn’t be overlooked.

All of these multiplier effects to batted ball stats end up meaning that batted ball (and k/9 and bb/9) can explain a LOT of variation in ERA from pitcher to pitcher.

As anyone knows, one lucky bloop single can drive in a couple of runs and keep an inning going for more damage. This luck adjustment is critically important to forecasting what a pitcher’s ERA will be moving forward.

In fact, using only LOB%, k/9, bb/9, babip, and hr/9, we can predict a pitcher’s ERA with 96% fit! That means if you can predict LOB% and BABIP with any skill, which can be done with batted ball stats, we can come a long way to predicting ERA.

Let’s keep this conversation going. Let’s work together to come up with a good way to predict ERA. I don’t think we’re all the way there, but we’re getting closer

Brandon Heikoop said...

Derek,

I agree. But again, in a bubble, no statistic is truly accurate. We see all the time that author’s will call for a player to be a sleeper based solely on his BABIP compared to the league average, which can be useful, but in the same sense, is not 100% accurate.

For me personally, I will often use FIP contextually because I am lazy and typically utilize Fangraphs as my statistical database. If someone challenged me on this I certainly would not disagree and say, “Noooo, FIP is correct” in the same sense that you wouldn’t with xFIP, LIPS, or DIPS.

Derek Carty said...

Brandon,

Absolutely right. No stat is 100% accurate. If it’s a matter of being lazy, that’s perfectly fine for fantasy players. But for analysts to be lazy is by no means fine (and I’m pretty sure they’re not lazy, just misinformed or unaware or something like that). While we will never be 100% accurate, a stat like LIPS or xFIP will be *more* accurate than FIP, and that’s the best we can do.

Derek Carty said...

William,

I don’t think I’m underestimating anything. While you’re absolutely right that batted ball shifts can have big effects, the biggest ones are more a matter of luck evening out than anything else.

We must remember that pitchers do not have much control over LDs, and that when we evaluate batted ball stats we should always normalize the LD% first. If you dig through the archives, you’ll see I always refer to xGB% as opposed to actual GB%. xGB% normalizes the LD%.

I don’t think it’s a sound approach to say, “Player X has a 10% LD% and .300 BABIP. If he gains 10% on his LD%, his BABIP will rise to .350.” More goes into BABIP than just batted ball stats, and if we are going to use them to predict BABIP (certainly a fine pursuit), we shouldn’t be basing them on shifts and then applying that to the player’s current BABIP.

What I mean is that it’s much sounder to say “LDs become hits 72% of the time, GBs 24% of the time, FBs 14% of the time” and then apply those rates to the pitcher’s expected number of each hit type and derive an expected BABIP that way.

It’s not unexpected that a 10% shift from LD% to GB% will result in a .050 BABIP shift on average (I actually get a nearly identical result with the method listed above) because LDs fall in for hits at such a high rate. We can’t just tack 0.050 onto a pitcher’s BABIP, though, and call it a day. It’s theoretically possible for a pitcher to have a lucky LD% and unlucky BABIP, and simply adding 0.050 would just make things worse in these cases. And since LD% doesn’t perfectly mirror BABIP, there will be some degree of problems in almost all cases. It’s much sounder to look at the expected batted ball profile and weight each type from there.

As far as predicting BABIP and LOB%, we can certainly do it, but because there is so much variation in them, our accuracy won’t be terribly good. Even an extreme ground ball pitcher won’t be expected to post much higher than a .310 BABIP while it’s commonplace for pitchers to finish a season above .330 (due to luck, of course). And there is a multiplier effect on LOB% that should be considered, but it’s not enormous.

As to the 96% fit, this is perfectly logical because we’d essentially be using everything that can happen to a pitcher: hits, walks, strikeouts, home runs, ability with runners on base. Add in HBP and types of hits and we’d be all the way there (or very, very close, ignoring little things like controlling the running game, etc).

Finally, I think we’re very much on the same page but just not expressing it in the same way. K, BB, and batted ball stats are certainly where we’re at right now in predicting ERA, and we’re both saying it. I’m not sure if you’re familiar with LIPS ERA, but it’s arguably the most advanced ERA estimator right now and accounts for all of these things… and more. I’d highly recommend checking these two articles out on it:

http://www.hardballtimes.com/main/article/dips-lips-and-hips/

http://www.hardballtimes.com/main/fantasy/article/explaining-lips/

Will Larson said...

Derek,

I think you’re right.

We are talking about basically the same thing but from different ways. I’m saying, “we observe a BABIP, LOB%, and and ERA. Now let’s try to explain it using X,Y, and Z.” You’re saying “Usually, X, Y, and Z result in so many hits and HRs, so this is what the BABIP and ERA should be.” I think we arrive at basically the same conclusions either way.

Please do think about LOB% though. There is a substantial multiplier effect of an inflated BABIP on ERA via LOB%.

When this season is over, I’m going to do some forecast error analysis using LIPS, FIP, xFIP, and my Luck-Adjusted ERA for 2008 along with CHONE, Bill James, and Marcel forecasts for 2009 and seeing how well they predict 2009 stats.

If you or anyone else has any other forecasts you’d like me to put into this analysis please let me know (it’s easy to evaluate additional forecasts once I get everything up and running).

I think we’d all be interested in seeing which forecasts are the best. Then we can start moving towards improving our forecasting methods.

NadavT said...

I’m curious how you’re defining “more accurate” when you’re comparing FIP and xFIP. Are you only interested in predicting a pitcher’s performance in the following season, or are you looking at within-season accuracy as well? Looking at THT’s pitcher stats, I took a quick look comparing FIP and xFIP to ERA for the top 50 pitchers in both 2007 and 2008 (ranked by xFIP), and FIP was closer to ERA more than 60 % of the time. 100 pitcher-seasons might not be enough of a sample for the HR/FB regression to truly even out, but that’s sort of the point—for the sample sizes that fantasy players care about, can you really be confident that HR/FB rates will regress towards their expected values?

Derek Carty said...

We can absolutely be confident that HR/FB rates will regress towards their expected values, Nadav. As I noted in this article, of the 63 players with lucky/unlucky HR/FBs in Year 1, just 7.9% of them *did not* regress in the direction of league average in Year 2 (and there may even be some permanent outliers included in that percentage).

As to FIP and ERA matching up better, this is actually the result we should expect. xFIP is a luck neutral stat, while ERA and FIP both incorporate some elements of luck. ERA is very luck dependent, while FIP is somewhat luck dependent. Both include actual HR totals, while xFIP ignores them completely, so logically ERA and FIP will be closer because the components they’re using are more similar (not a good thing).

If we were to create a flow chart to illustrate this, it’d look something like this:

ERA—-> ERC—-> FIP—-> xFIP—-> LIPS

Doesn’t mean FIP predicts ERA better, just that it produces results that are more similar.

When we run tests to see how accurate a predictor is, we almost always use a Year 1 and Year 2 because it allows us to have larger sample data. But the results can be generalized to in-season performance. If the results show that HR/FB is unstable, it’s going to be unstable whether we’re looking at first half/second half, year one/year two, or whatever. The larger the sample the more stable it becomes, but just because there are more fluctuations within subsets of a single year doesn’t mean the regression doesn’t occur or shouldn’t be expected to occur.

Let me know if you have any more questions!

Derek Carty said...

Will,

No argument that BABIP impacts LOB%. It certainly does. But I don’t think that the controllable aspects of BABIP will result in huge LOB% differences.

If we run a quick regression, we get a regression equation that looks something like this:

100.69 + BABIP*-95.99

which derives a table like this:

BABIP-xLOB%

0.280-73.82

0.285-73.34

0.290-72.86

0.295-72.38

0.300-71.90

0.305-71.42

0.310-70.94

0.315-70.46

0.320-69.98

If we were to predict a pitcher to have a BABIP of .310 or even .315 as a result of GB%, he’d still only lose about 1% off of his LOB% as a result. K rate would impact it as well (fewer Ks = higher BABIP and LOB%), and this is very rough stuff, but I think it shows that the effects aren’t enormous.

That being said, I have been meaning for quite some time to put together xBABIP and xLOB% metrics because this kind of stuff is certainly there but just haven’t gotten around to it.

Will Larson said...

I’m glad you use regression analysis. Makes this much easier to talk about

If you’re interested in xLOB metrics, just regress LOB% on K/9 and BABIP. Keep in mind that xLOB% is a function of xBABIP though (you have to adjust your BABIP first, then use this as an input to compute xLOB%). That explains most of what you can. xLOB% and xBABIP can be found for 2008 at http://www.williamlarson.com/baseball_spreadsheet.xls

Will Larson said...

I can write an article introducing xBABIP and xLOB% if you’d like, as well as numbers for this year. Please email me at

if you’re interested.

NadavT said...

Thanks for the response, Derek. I realized the point about FIP matching up better with ERA by definition (because it takes HRs as a given, rather than being influenced by luck) after I posted the question. Nevertheless, I still think it’s important to distinguish between regression that’s expected to occur over an entire season and what might be expected to occur over a month or two. I know that people have looked at the different sample sizes necessary to have confidence in a trend for each stat, so I don’t know if HR/FB is the kind of stat that regresses fairly rapidly or if it takes half a season or longer to regress to its expected level.

Also, because xFIP doesn’t account for park effects, it’s possible that it can be a biased predictor for a pitcher’s performance over a string of starts in ballparks that are on either end of the HR-allowing spectrum. I recognize that more advanced stats account for park factors, but they’re not as easily available as xFIP.

Clint said...

One issue I haven’t seen mentioned in these discussions is that ANY fielding-independent metric needs to be put into context when used for fantasy – you don’t necessarily WANT to remove fielding from the analysis, as it influences 3 of the 4 standard fantasy SP stats(although it’s not a great idea to chase Wins). Ignoring the inherent flaws of FIP, this is the #1 reason in my mind to avoid it in fantasy discussions (without proper context at the very least.)

Derek Carty said...

Will,

I’ll check out what you’ve done. I’ve actually got my own versions of xBABIP and xLOB% in my personal stat database, so I’ll be interested to see how they match up.

Derek Carty said...

NadavT,

I don’t think I agree with distinguishing “between regression that’s expected to occur over an entire season and what might be expected to occur over a month or two.” There is actually no difference. If regression is to occur, we must expect it to occur immediately. Absolutely must. Let’s say we’re at the All-Star Break and we expect the regression to be complete by the end of the season. How can a player’s numbers regress in that time, though, if we are constantly expecting it to *not* regress in the upcoming small sample of games?

If we can say that HR/FB is luck influenced and will eventually regress (which we can), it is incorrect to assume that it *won’t* regress in the pitcher’s next game simply because it’s a smaller sample. Sure, there is lots of room for random variation, but there is absolutely no way to predict it, so all we can do is assume neutral luck. Trying to do anything more is pure folly. If Randy Johnson has a 25% HR/FB in May, it doesn’t matter one ounce whether his first June game is a small sample or not. His 25% HR/FB is not reflective of his true talent level, so we can’t expect him to continue posting it just because the sample size we’re looking at is small and prone to fluctuations.

This is actually a critical point to fantasy analysis that is of the utmost importance for you to understand.

While we can run tests and figure out how long it will take for a stat to stabilize, all this tells us is how to best estimate the player’s true talent level. If a stat takes a long time to stabilize, we might say that we should estimate using 30% player/70% league average after 200 TBF or something like that. If it takes a short time to stabilize, we might use 70% player/30% league average.

This actually implies the exact opposite of the impression you seem to be under. The more unstable a stat is and the smaller the sample, the *more* we should expect the pitcher to perform at a league average level.

As far as park effects go, it’s absolutely fine to make some adjustments if the pitcher will be playing in a non-neutral park. A stat like xFIP is supposed to isolate the player’s ability. Context must be considered separately.

Derek Carty said...

You’re right Clint, that context is important. Unfortunately, fielding isn’t easily added to stats like these, and simply using ERA because fielding is included is certainly not the answer. Still, that doesn’t mean ERA estimators should be avoided altogether. It’s very important to be able to say something about the ability of the pitcher himself. We just need to remember to apply the context afterwards. I’m not sure if you’re a long-time reader, but if not, I’d recommend checking out my CAPS stat that accounts for a lot of different contextual things (and more additions are planned):

http://www.hardballtimes.com/main/fantasy/article/introducing-quality-of-opponent-adjustments-and-caps-for-pitchers/

http://www.hardballtimes.com/main/fantasy/article/introducing-caps-road-park-factors/

Will Larson said...

Derek, can you email/post your xBABIP/xLOB% for 2008? I’d like to compare with a full season of data if possible.