More recently, baseball researchers have focused on so-called defense-independent pitching statistics (DIPS) to try to better isolate the factors that a pitcher can help control. Voros McCracken is credited with starting the movement, but Tom Tango is responsible for the most widely-used DIPS-type equation: Fielding Independent Pitching (FIP). FIP, in turn, has spawned a legion of other estimators seeking to improve upon its simple formula, often seeking different ends. Well-known derivatives include xFIP and SIERA; other recent efforts include TIPS and BERA.
However, none of these metrics is able to consider the context of each underlying event. They don’t account for each batter the pitcher faced, the number of times the pitcher faced that batter over a season, the catcher to whom the pitcher threw, or the umpire behind the plate. They also don’t consider how each event was affected by the stadium in which it occurred, the handedness of the pitcher and the batter, or the effect of home-team advantage. Nor do they account for a pitcher throwing in a loaded division, as opposed to a pitcher running up his stats against lesser competition. This both limits their overall effectiveness and, in particular, their usefulness with smaller sample sizes.
To help address these issues, this article introduces Contextual FIP, or cFIP. (I recognize that “cFIP” is a commonly used shorthand for the FIP constant that puts FIP on an ERA scale. Unfortunately, I haven’t thought of a better name, so cFIP it is. Sorry.) Building on the mixed-model approach we developed at Baseball Prospectus for Called Strikes above Average (CSAA), cFIP seeks to provide this missing context. Each underlying event in the FIP equation — be it a home run, strikeout, walk, or hit by pitch — is modeled to adjust for, as appropriate, the effect of the individual batter, catcher and umpire; the stadium; home-field advantage; umpire bias; and the handedness relationship between pitcher and batter present during each individual plate appearance.
cFIP has multiple advantages: (1) it is more predictive than other pitcher estimators, especially in smaller samples; (2) it is calculated on a batter-faced basis, rather than innings pitched; (3) it is park-, league-, and opposition-adjusted; and (4) in a particularly important development, cFIP is equally accurate as a descriptive and predictive statistic.
The last characteristic makes cFIP something we have not seen before: a true pitcher quality estimator that actually approximates the pitcher’s current ability. I recommend both its use and its further refinement.
FIP, and Its Variants
The generally-accepted equation for FIP, published at FanGraphs, is as follows:
FIP is quite useful in its current form. Among starters who qualified for the ERA title last season, on a scale of 0.0 (none) to 1.0 (perfect), the weighted correlation of their FIP to their ERA was 0.71. Among pitchers who threw 40 innings or more, the correlation was 0.70. When trying to decide if a pitcher’s terrific stretch is “for real” or a likely fluke, FIP is one of the first places knowledgeable baseball fans look.
FIP has also been extended in various ways. xFIP, conceived by Dave Studeman, operates on the premise that while a pitcher’s flyball rate is a skill, the number of fly balls that leave the stadium often is not. So, xFIP replaces a pitcher’s home run rate with his flyball rate times the league-average home run-per-fly ball rate. xFIP has less descriptive value but more predictive value than FIP. Other researchers have developed estimators that apply different weights to strikeouts, reward sequencing, and add additional components, such as batted-ball data. SIERA, developed by Matt Swartz, is the most widely followed of these and functions reasonably well in predicting future runs allowed.
But these metrics also share important limitations. First, as currently designed, their formulas target a “deserved” ERA or some similar measure of runs allowed. This is a traditional, but increasingly questionable, goal. Earned run average, of course, charges a pitcher only for runs that were “earned” in the opinion of the official scorer. Earned or not, the runs-allowed system charges pitchers with the full weight of runners they put on base but that a subsequent pitcher allowed to score; the latter pitcher is charged nothing.
This distinction made sense decades ago when we couldn’t allocate the likelihood of run-scoring between two pitchers. That is no longer true. Moreover, runs-allowed uses runs per nine innings as its denominator rather than total batters faced. Although this distinction does not always make a difference, a pitcher who consistently allows four or five runners over the course of three outs is simply not as good, and will not be as successful in the long term, as a pitcher who usually retires all three batters he faces. In summary, we should not be calibrating our run estimation metrics to ERA if we can avoid it.
Instead, we should be using RE24, a sabermetric improvement that unfortunately sounds like a pharmaceutical. Rather than whack one pitcher or the other with the entire consequence of a handed-off runner, RE24 debits a departing pitcher solely for the run expectancy of the situation left behind, and similarly debits a reliever only for the runs scored in light of that pre-existing expectancy. The reliever who gets out of an inherited jam will be credited accordingly. And in RE24, runs are runs, regardless of whether they are “earned” or not.
RE24 is not perfect, either. It does not consider defense and holds a pitcher fully responsible for everything that happens on the field on a play. But these shortcomings are equally true of ERA, and are no reason to avoid RE24. In this article, we will use RE24 per plate appearance (RE24/PA), not ERA, to compare the abilities of these metrics to one another. RE24 is published at FanGraphs.
The second and larger problem is the focus of this article: that, as discussed above, the underlying statistics used by FIP and its brethren lack context. Home runs are the same in Petco Park as they are at Coors Field. A walk to a .250 wOBA hitter is the same as a walk to a .400 wOBA hitter. Striking out Matt Carpenter is as impressive as striking out Javier Baez. And brushing a batter who crowds the plate is the same as drilling an ordinary hitter in the back.
The inability to consider each event individually also limits the effectiveness of existing attempts to make park and league adjustments. Variants like “FIP-,” “xFIP-,” and “ERA+/-” do try to account for park and league. But because they can’t consider each plate appearance individually, they are forced to make broad assumptions across a pitcher’s seasonal statistics. As far as I can tell, these statistics assume that half of each pitcher’s innings were pitched at home (not necessarily true), and that the remainder of his games were played at stadiums whose run-scoring environments cancel each other out (also not necessarily true).
These metrics, therefore, apply a park scoring factor to half of a pitcher’s innings commensurate to his home stadium and assume the remaining events all occurred in a league-average scoring environment. These assumptions may be close enough for missile work, and we’ll see below that they do produce a slight improvement in the results as compared to their original metrics. But we can now do better, and we should.
Recently, Baseball Prospectus published an article I wrote with Harry Pavlidis and Dan Brooks that introduced Called Strikes above Average (CSAA) to help measure catcher framing. The article endorsed the use of mixed models to account for the context of each plate appearance involved in a player’s season. Mixed models allowed us to dramatically improve our understanding of catcher framing, and mixed models can open new doors in other areas of baseball research as well.
With a mixed model, we can introduce context to FIP while retaining much of its simplicity. A mixed model can simultaneously weight every plate appearance in a season to determine whether a pitcher is truly home-run prone or primarily a victim of stadium and schedule. We find out whether a pitcher is actually a strikeout master or just carving up shark bait. We find out these things in far fewer plate appearances than other metrics require. And what we will find, in the end, is that cFIP harmonizes descriptive and predictive DIPS, allowing us to estimate the pitcher’s true pitching talent during a particular season.
For an explanation of how mixed models work, please review our CSAA article at Baseball Prospectus. The models I created here are similar to the CSAA models, and the specifications for each of them are in the Appendix for those who wish to try out the code for themselves.
As with CSAA, I selected a generalized linear mixed model. I used the free R computing environment (3.1.2) and the freely-available lme4 package. I then downloaded, from Retrosheet, every plate appearance from the 2011 through 2014 seasons and excluded those that did not include a terminal batter event. I specified models for each current component of FIP — home runs, walks, strikeouts and hit by pitch — and applied a mixed model to readjust those components for each pitcher in light of the adjusted circumstances of each plate appearance.
The revised numbers for each pitcher, as compared to a league-average (“null”) probability, were then multiplied by the standard FIP coefficients and summed. The resulting number is converted to a “minus-style” metric on a scale of 100 with a standard deviation of 15. For the 2014 season, all 183,929 plate appearances were modeled, and the context-adjusted FIP (cFIP) of each pitcher was collected. The model takes about 15 minutes to run on a season’s worth of data.
cFIP in Action
Because cFIP is on a 100 “minus” scale, 100 is perfectly average, scores below 100 are better, and scores above 100 are worse. Because cFIP has a forced standard deviation of 15, we can divide the pitchers into general and consistent categories of quality. Here is how that divides up for the 2014 season, with some representative examples:
|Representative Examples, 2014 Season|
|cFIP Range||Z Score||Pitcher Quality||Examples|
|<70||<-2||Superb||Aroldis Chapman (36/best), Sean Doolittle (49), Clayton Kershaw (57), Chris Sale (63)|
|70–85||<-1||Great||Zach Duke (72), Jon Lester (75), Mark Melancon (75), Zack Greinke (82)|
|85–95||<-.33||Above Avg.||Hyun-jin Ryu (87), Francisco Rodriguez (88), Johnny Cueto (89), Joba Chamberlain (90)|
|95–105||-.33 < 0 < +.33||Average||Tyson Ross (95), Sonny Gray (96), Matt Barnes(99), Brad Ziegler (104)|
|105–115||>.33||Below Avg.||Brian Wilson (106), Tanner Roark (107), Nick Greenwood (111), Ubaldo Jimenez (112)|
|115–130||>1||Bad||Edwin Jackson (116), Jim Johnson (120), Kyle Kendrick (124), Aaron Crow (125)|
|130+||>2||Awful||Brad Penny (130), Paul Maholm (131), Mike Pelfrey (132/worst), Anthony Ranaudo (132/worst)|
I’ve provided a mix of starters and relievers for each approximate category. Obviously, it is more impressive for a starter to achieve each category than a reliever, because a starter pitches so many more innings. Any cFIPs under 70 are, for starters, basically your Cy Young candidates, provided they pitch enough innings: Kershaw, Sale, Corey Kluber, Yu Darvish, Jose Fernandez, and Max Scherzer. We might as well include Phil Hughes and David Price, who checked in at exactly 70.
But, go ahead and check out the cFIP scores for yourself. I’ve posted the results for every pitcher in baseball for 2011, 2012, 2013, and 2014. Review, compare, and discuss to your heart’s content.
Where does cFIP disagree with other DIPS metrics the most? Those comparisons are easiest to make with FIP- and xFIP-, as they are not only also on a “minus” 100 scale, but also represent two of the most popular metrics that try to account for park and league. Compared to FIP-, here are a few significant disagreements among pitchers with 170-plus batters faced. Remember, lower numbers are better, and higher numbers are worse.
|Significant Differences Between FIP- and cFIP-|
By far the biggest gulf belongs to Frieri, whom cFIP identifies as incredibly unlucky last year, with his struggles better explained by the quality of his opposition and ballparks. The team that saw through this and signed him as a bounce-back candidate was — surprise! — the Rays. At $800,000 plus incentives, the Rays seem poised to capitalize on yet another inefficiency.
Sabathia is a similarly interesting candidate. Although his performance last year was ugly, at a FIP- of 123, cFIP sees him, even blind to his injury issues, as a still-above-average pitcher who ran into a buzz saw of circumstances.
On the other end of the spectrum, there is Anderson, signed with some fanfare by the Dodgers this offseason. Anderson had a sparkling 2.99 FIP in limited action, which looks impressive at first glance when you consider it was achieved with the Rockies. However, cFIP is not buying it, viewing him as a purely average (99) pitcher considering his opponents and even his ballparks. cFIP thus sees Mr. Friedman as having probably overpaid for Mr. Anderson, but the Dodgers, for whatever reason, seem to be trusting his FIP to combine with some actual good luck on injuries.
The disagreements between cFIP and xFIP- are less extreme, which is unsurprising given that xFIP operates on a tighter distribution than FIP. Nonetheless, there are still some notable disagreements, particularly with starters:
|Significant Differences Between xFIP- and cFIP-|
xFIP is a funny thing: while it often grants appropriate compassion to victims of bad flyball luck, it also refuses (by design) to credit pitchers who excel at minimizing flyball damage. That certainly seems to describe Scherzer and Hughes, both of whom were punished by xFIP’s typical regression toward league average on fly balls. cFIP does not see it that way, and in fact finds their 2014 performances to be downright exceptional in light of the competition and ballparks they faced.
The same cannot be said for Iwakuma, about whom cFIP is more skeptical. Granted, a cFIP of 88 is nothing to sneeze at; Iwakuma is still an above-average pitcher. But cFIP seems to feel Iwakuma should have performed much better given his pitcher-friendly home ballpark (Safeco Field) and some of the terrible in-division teams (Rangers and Astros) he got to face last year.
cFIP certainly talks a good game: context-adjusted, 100-scale, batters-faced as a denominator: these all sound promising. But what is the best way to compare its effectiveness to current DIPS statistics? The answer to that question is, for me, the most fascinating part of this study.
Let’s start with the basics. As most of you know, statistics are commonly divided into two general categories: descriptive and inferential. Descriptive statistics describe what has happened in the past, whereas inferential (a.k.a. “predictive”) statistics are focused on drawing inferences about the future from the limited information we have now. Until now, pitcher metrics have forced baseball researchers to choose between those two characteristics.
That is about to change. To understand why, we need to review the descriptive and predictive abilities of all these metrics.
To compare descriptive power, let’s look at the average in-season performance of the various estimators, correlating to Fangraphs’ RE24/PA. At the suggestion of Tom Tango, I’ll also include kwERA (formula: (K – (BB-IBB+HBP)) / PA). We’ll average a four-year sample for each:
|Average In-Season Estimator Performance, 2011-2014|
This chart considers all pitchers with 170-plus batters faced, which is approximately equivalent to at least 40 innings pitched. The pitching metrics all have an inverse correlation with Fangraphs’ RE24, so remember -1.0 is the highest possible score, showing a perfect negative correlation, with 0.0 still the worst (meaning no correlation).
Not surprisingly, RA9 (raw runs-allowed per nine innings) ties as the most accurate, with ERA’s park / league adjusted cousin, ERA-, also providing the tightest correlation to RE24/PA. Because runs-allowed metrics tell you simply how many runs crossed the plate, this is to be expected. After ERA, which does only slightly worse, we have kwERA at -.84, with FIP and FIP- checking in -.64 and -.65, respectively. cFIP registers an average of -.63, whereas SIERA and the xFIPs bring up the rear by a small amount.
If we do a weighted correlation of all pitchers to RE24, including those who faced as few as one batter, here are the results:
|Average In-Season Estimator Performance Correlated to RE24, 2011-2014|
The pecking order is very similar to before, except cFIP and kwERA now move to the bottom in performance. Is this concerning? Actually, it’s not, for reasons that we’ll see in a moment.
So, let’s talk about the other side of the coin: the ability to draw inferences about the future. For most people, this is where the rubber hits the road. There is something to this sentiment. We already know how many runs came across the plate. What we usually want to know is if there is any reason the pitcher’s results should change. And so, much of the discussion of DIPS metrics tends to revolve around each estimator’s ability to predict runs allowed by a pitcher in future seasons.
We’ll use the so-called “Year+1” test for these metrics: how well they do in predicting RE24 per plate appearance (RE24/PA) in the pitcher’s next season. I took the harmonic mean of the total batters faced from consecutive seasons for each pitcher. Again, we are looking for a range of -1.0 (best) down to 0.0 (worst). Here are the results, first for pitchers with 170-plus batters faced, then with all pitchers in a given season:
|Predicting RE24/PA, 2011-2014, Min. 170 TBF|
|Predicting RE24/PA, 2011-2014, All Pitchers|
cFIP is the winner here, which is what we would expect from a metric that benefits from considering the context of each plate appearance. So, going forward, cFIP appears to be a better choice than any other metric for predicting future runs allowed.
At the same time, none of these performances is something to write home about. A -0.40 isn’t bad, necessarily, but it may not be that much better than simply picking random numbers or just projecting everyone to be average.
Why does every metric do so poorly in predicting next-season RE24/PA? Much of it probably is the time-honored concept of regression to the mean: those who went up will generally come down, and vice versa. Predicting exactly which players will obey this rule and how severely they will obey it is quite difficult.
I would like to propose a different approach for predicting future performance. I believe the ideal goal of a pitcher estimator should be to estimate the inherent quality of the pitcher, not to specifically estimate the pitcher’s future runs allowed. Future results, after all, are a combination of pitcher quality + circumstances. (And random variation, but that is always present.)
I would argue the best way to account for the latter is through projection systems, not pitching estimators. Projection systems like PECOTA, Steamer and ZIPS are designed to take into account circumstances like changes in team, stadiums, injury history, and the like. They can also explicitly incorporate a general regression factor as part of these complicated adjustments.
Thus, if we are looking for an accurate estimator of pitcher ability, what we should be considering is not how the estimator predicts future run expectancy, but how the estimator correlates with itself in consecutive seasons. After all, we already know from our descriptive analysis how well these metrics correlate in-season to run expectancy; what we really want to know in projecting future results is whether the metric is accurately assessing the same qualities in the pitcher. We find that by testing the metric against itself out of sample. The hypotenuse of that analytic triangle — the translation of the pitcher metric to future run expectancy — then can be implemented in an actual projection system.
To do this, I compared the season-to-season correlation of the various run estimators for all pitchers with at least 170 batters faced. These figures will be positive, because we are correlating the metric to itself. So, the best score is 1; the worst is 0. Here is what I found:
|Season-to-Season Correlation, Run Estimators, 2011-2014|
cFIP is the clear winner, followed by kwERA, SIERA, the xFIPs, the FIPs, and last of all by RA9 and the ERAs, the latter of which, as most of you already knew, have very little value in predicting future run expectancy.
cFIP’s victory in predicting future performance is an impressive feat. Remember SIERA and xFIP both are designed to excel at prediction, because they disregard the most volatile factor of all (past home runs) and replace it with a regression component. cFIP, however describes only the actual events that happened — including home runs — and still beats both SIERA and xFIP. kwERA is also quite impressive.
But that’s not all. Predictability is usually most important in small sample sizes. ‘X’ pitcher has had two good months. Will he probably continue to pitch that well? This is where cFIP really shines. Including pitchers with as few as one batter faced, look at the year-to-year correlation of cFIP to itself, as compared to other metrics:
|Season-to-Season Correlation, CFIP, 2011-2014|
cFIP crushes the other metrics. Far better than any other estimator, cFIP predicts how capable a pitcher will be in the near future as compared to his current performance. Because of its contextual adjustments, cFIP retains the vast majority of its strength even when low-sample pitchers are included. The predictive value of cFIP is clear.
The Pitcher’s True Talent
Having explored descriptive and predictive tendencies, it’s time to move on to the next step. Before I go further, it’s important to note that I don’t think the author of any current estimator — even xFIP or SIERA — would claim they are purporting to estimate any pitcher’s true talent in their metrics. Rather, they are focused on better describing either what caused a pitcher’s runs allowed, or predicting how his current results will regress in the future. cFIP, however, allows us to be bolder: it permits us to estimate the pitcher’s true talent in the components we are measuring.
When is a pitcher quality estimator actually isolating true talent? My answer is this: when there is a substantial similarity between the estimator’s descriptive and predictive power. If an estimator is truly isolating a pitcher’s talent, there should not be much difference between the two. If an estimator is doing well in one aspect and poorly on another, then it is not estimating a pitcher’s true ability: rather, it is over-fitting past results to better explain what happened (primarily descriptive) or under-fitting past results to minimize future error (primarily predictive).
There is nothing wrong with choosing statistics that skew one way or the other on the descriptive-predictive spectrum, particularly when the author is transparent about which way the statistic swings. But a statistic that is notably skewed one way or the other is not accurately evaluating pitchers’ actual ability.
The degree of similarity between a metric’s descriptive and predictive power reduces to simply taking the mathematical difference between the two. Note that I am defining “predictive” through my preferred “estimator predicting itself” method. I’ll use absolute values to keep things simple, and please remember that a lower differential is better. Here is how our estimators stack up with pitchers who have faced 170-plus batters:
|Descriptive vs. Predictive Power, Min. 170 TBF|
And here is the same comparison for all pitchers, regardless of sample size:
|Descriptive vs. Predictive Power, All Pitchers|
cFIP is the winner and is by far the most consistent in its descriptive and predictive assessments of pitcher performance. In other words, at all times, cFIP does by far the best job at assessing a pitcher’s true underlying ability (within the components it considers). The other statistics consistently overfit past performance relative to each player’s true talent when evaluating in-season performance. Again, there is nothing wrong with that: they are trying to explain what happened and, to varying degrees, doing a good job of it. But what they are not doing is consistently revealing the true talent of the pitcher on the mound, particularly in small samples.
Although cFIP is an exciting development, I consider it to be the beginning, not the end, of our efforts to bring better context to baseball statistics. If CSAA brought mixed models out into the open, then cFIP demonstrates we have many other potential applications for them. In that regard, there is no reason why xFIP, SIERA, and other promising efforts like TIPS and BERA cannot themselves be reworked within a mixed model framework. When so reinforced, they may very well surpass cFIP. This is particularly true of kwERA, particularly if researchers are comfortable classifying its ability to project true talent as arising solely from its strong predictive ability. I hope other baseball researchers make these efforts, and to help them do this, I have provided the model specifications for the underlying cFIP components in the Appendix.
I look forward to a robust discussion of what cFIP means and how it can make baseball analysis better. For the time being, cFIP gives us a glimpse of the world into which we are headed.
References & Resources
- Special thanks to Tom Tango, Harry Pavlidis and Dan Turkenkopf, all of whose suggestions made this a much better paper. Any remaining errors are solely mine.
- Bates D, Maechler M, Bolker B and Walker S (2014). _lme4: Linear mixed-effects models using Eigen and S4_. R package version 1.1-7
- R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
- The indispensable Retrosheet.
Appendix: The Models
For home runs, I used the following equation:
HR.2014.glmer <- glmer(HR ~ stands*throws + stadium + (1|batter) + (1|pitcher), data=HR.2014, family=binomial(link=’probit’), nAGQ=0)
This is a generalized linear mixed model. The output is whether the plate appearance ended up in a home run (1,0). The fixed-effect variables that were considered included stands*throws (an interaction between the batter’s side of the box and the pitcher’s handedness) and stadium (the park in which the home run took place). The random effects are the batter and pitcher involved.
The mixed model computed a conditional mode for each pitcher, in each season, as to whether they made a home run more or less likely than average, and to what extent. As with CSAA and the other models in this paper, the probability of a home run for each pitcher was subtracted from the null probability of a home run under average circumstances, to isolate the net home run probability contributed by each pitcher above or below average for the season. The same process was used for all the other components evaluated.
For (unintentional) walks, I used the following model:
BB.2014.glmer <- glmer(BB ~ bat_home + stands*throws + stadium + (1|batter) + (1|pitcher) + (1|catcher) + (1|umpire), data=BB.2014, family=binomial(link=’probit’), nAGQ=0)
For hit-by-pitch events, I used this model:
HBP.2014.glmer <- glmer(HBP ~ bat_home + stands*throws + stadium + (1|batter) + (1|pitcher) + (1|catcher), data=HBP.2014, family=binomial(link=’probit’), nAGQ=0)
Finally, for strikeouts, this model was used:
K.2014.glmer <- glmer(K ~ bat_home + stands*throws + stadium + (1|batter) + (1|pitcher) + (1|catcher), data=K.2014, family=binomial(link=’probit’), nAGQ=0)