FIP, Game Score, and evaluating starting pitching

If you read my first piece at THT a couple weeks ago, you are aware of my recent interest in questions as they pertain to baseball, and my not-so-recent interest in questions as they pertain to life, philosophy, and all that jazz. I’m a strong believer that in order to discover new things, and improve upon the things that we have already discovered, we need to know how to ask the right questions.

Last time I looked at what sorts of questions the most common offensive statistics do, and/or are trying to, answer. Naturally, my first thought for this piece was to turn to pitching statistics. But instead of looking at three or more stats like last time, I’m going to be more specific and expand on a couple stats that I find most interesting.

Which stats? Well, though there is much to be said about saves, I find the way in which we measure starting pitcher performance much more fascinating, maybe just because it’s so much more important. Traditionally, pitcher wins and ERA are the primary factors in determining a starting pitcher’s performance, as well as the quality of an individual start. Of course, we all know that ERA and especially wins aren’t the best indicators of pitcher quality, especially at an individual game level.

But just because they aren’t perfect doesn’t mean they aren’t useful. Why are they useful? Because they give us interesting questions to answer, of course!

Pitcher Wins

On the surface: “In how many games did the starting pitcher’s team win the game, given that he pitched five innings, left with a lead, and that lead held for the remainder of the game?”

Digging deeper: “In how many wins was the pitcher the main contributor (of all pitchers) to said win?”

The fundamental question: “How many games did the pitcher help his team win?”


On the surface: “At what rate did the pitcher prevent runs from scoring, not counting runs resulting from plays deemed to be errors by the official scorer?”

Digging deeper: “At what rate did the pitcher, of his own accord, prevent runs from scoring?”

The fundamental question: “How well did the pitcher pitch?”

In the end, wins and ERA are getting at very similar questions. The former wants to measure pitcher performance as a counting statistic—that is, how many wins can we credit to the pitcher—while the latter wants to measure pitching performance as a rate statistic—that is, simply, how good is the pitcher?

Here’s the thing: for a starting pitcher, there is less use than one might think in separating the counting statistic and the rate statistic. After all, though we divide a starter’s performance into a number of units, such as batters faced and innings pitched, in the end the most important unit for a starting pitcher is games started.

Before you object, I don’t mean that starts should be the denominator when we’re measuring strikeout rate or walk rate or what have you. I mean that games started are the only units that are, for the most part, independent of each other. In other words, each batter faced is not independent—what happens in one plate appearance has an effect on the next plate appearance, whether it be a change in approach, pitching out of the stretch, or fatigue for the pitcher. The same goes for innings pitched; each inning is not independent of the previous one.

On the other hand, each start is basically a new event for the pitcher. Sure, there might be some fatigue or changed approach carried over, but for the most part, the end of one start marks the end of that pitching performance. This is what pitcher wins gets right—in theory, it counts the starts that helped the team win and doesn’t count the starts that didn’t do this.

Getting back to rate stats vs. counting stats, the idea that games started is the most important unit for a starting pitcher means that as far as measuring pitching performance goes, we want to base our judgments on what a pitcher does per game, not per inning or per batter faced. There are certainly times—in fact, many or most times—in which those are better units of measurements, but when talking about the quality of starting pitching performance, we fundamentally care about what the pitcher does per game.

I’ll admit, this train of thought was prompted not by the above questions, but by a quick look at the FanGraphs leaderboards for starting pitchers after the first couple days of the season. At first, all the list isn’t all that surprising—Yu Darvish, Jeff Samardzija, and Clayton Kershaw, each of whom pitched gems—lead the pack in fWAR.

However, as you move down the list, there are a few questionable entries. One pair stood out in particular:

Matt Harrison: 5.2 IP, 6 R, 5 ER, 9 K, 3 BB, 0 HR, 0.2 WAR
Stephen Strasburg: 7 IP, 0 R, 0 ER, 3 K, 0 BB, 0 HR, 0.2 WAR

At least in my view, Stephen Strasburg pretty clearly pitched a better game than Matt Harrison. He pitched over an inning more than Harrison while giving up five fewer runs and allowing six fewer baserunners. So why is Harrison (barely) ahead in WAR? Well, as most of you likely know, FanGraphs uses FIP to calculate WAR, only counting strikeouts, walks, and home runs.

Now I’m not here to bash FIP—I think it’s a great metric and over a season it’s a great stat to use. However, it seems to miss the mark on an individual game level. Not only do we, as fans, care about non-HR hits and runs, but it seems ridiculous to discount them when evaluating a single pitching performance. If a pitcher gives up a bunch of line drive doubles and six runs, yet gives up no home runs and gets eight strikeouts and no walks, do we really want to say that he pitched a good game? I don’t think so.

Here’s the problem: if FIP doesn’t work on an individual game level, then it can’t really work on a larger scale. The reason it looks like it works is because those weird games where FIP overvalues or undervalues a performance tend to balance out in the end. For the most part, the flaws that we see on an individual basis don’t appear when we congregate many games together.

What do we conclude from this? Well, if we assume that each game started is basically independent of the previous one, all we need to do is figure out a way to evaluate each start on an individual basis, and then we can simply add up the scores for each start to determine the pitcher’s value.

That’s where Game Score comes in. A metric developed by Bill James, Game Score does exactly what I just described: evaluates each start individually in order to rank its quality. While the original Game Score formula may be a bit imperfect, the idea is sound. If we can figure out a good way to value each start individually, we can use that to value a season’s worth of pitching.

How can we improve Game Score? Well, that’s a great question, and one that I don’t know how to answer right now. First of all, I’d encourage you to check out a post by Tom Tango at FanGraphs a few years ago, in which he introduced some possible versions of Game Score (and inspired great discussion in the comments). Tom also reintroduced the post yesterday on his site (which I was unaware of until halfway through this article), so you should check that out and provide some input.

I can’t think of a good reason why we can’t use Game Score to evaluate a starting pitcher’s season-wide value or contribution. There may need to be some extra steps to account for park and league and scale the number to wins, but as a concept, this could an interesting alternative to our current evaluation tools.

Print Friendly
 Share on Facebook0Tweet about this on Twitter0Share on Google+0Share on Reddit0Email this to someone
« Previous: The daily grind: 4-3-13
Next: The daily grind: 4-4-13 »


  1. Matt Hunter said...

    That sounds fascinating, Jon – can’t wait to see what you find. But yeah, that would be an interesting way to evaluate starts. Seems pretty tough given the sheer number of variables that go into what makes a pitch effective.

  2. db said...

    fWAR is stupid because it looks at components.  bWAR is better because it looks at results.  Nobody calculates offensive value based on batted ball profiles.  No reason to do so for pitching.

  3. Todd Boss said...

    So, FIP is still “okay” to use on a more macrolevel because, quoting from above, “those weird games where FIP overvalues or undervalues a performance tend to balance out in the end.”

    Well, can’t I make the exact same argument about the much-maligned Pitcher Win statistic?  I can get a Win whether I pitch fantastically or awfully, and over time those games where I go 8 shutout innings and get a ND and those games where I give up 6 in 5 and get a win b/c my team has bashed their way to victory should “balance out in the end.”


  4. Carl said...


    Remember Tim MCcarver’s line – “It’s not what you throw. It’s how you got there”. 

    An 85 hanging curve looks awesome after 10 straight 90+ fastballs.  An 85 hanging curveball after an 85 good curveball is a homer.

  5. Carl said...

    Very open to comment/criticism:

    What about taking a study (pivot table type) of the percentage of starting pitchers who pitched the # of innings per game with the number of runs given up, and the percentage of times the teams won that game? 

    The resulting percentage would give each pitcher the same percentage of a win.  For example, if 92.9% of pitchers who pitched 8 shut-out innnings won, 93.1% who pitched 8.1 innings won, and 99.9% who pitched 9 innings won, then one would credit Clayshaw w a .99 win.  Based on the same chart, starters who pitched 9 innnings but gave up 8 runs yet won 12-8 wold get less of a “win”.  Starters who pitched < 5 innings would get 0 win credit.  Then, at the end of the year, the total “percentage wins” divided by starts would give the pitchers “winning percentage”.

    Advantages: Two pitchers who win 1-0 and 9-0 both get the same winning percentage.  Pitchers who pitch more innings would get a higher winning percenatge.  The pivot-table could be adjusted over time to evaluate different starters.  Pitchers on bad teams would get credit for wins even if they lost 1-0.  Pitchers w good records only because they get lots of run support would get fewer wins/lower winning perctage than otehr pitchers.  W/o looking at K’s, the above example from te article would be avoided.

    Please comment (and even follow-up w the pivot table and sample wins/winning percentages) as you see fit.

  6. Matt Hunter said...

    @Todd: You can make that argument, yes. The problem is that it takes much much longer for pitcher wins to balance out than FIP. For most pitchers, 200 innings of FIP will do a pretty good job at representing their performance, but 200 innings of wins can be vastly inaccurate. Not to mention the quality of the offense won’t balance out in one year.

  7. Jon Roegele said...

    @Carl – Yes, I’m sure there is some effect of pitch sequencing. The question is, how large is this effect? As an example, I looked at pitches that cause pop outs at one point, and found that the cause appeared to be much more due to the characteristics of the actual pitch that caused the pop out than the difference in location/velocity/movement from the previous pitch.

    I don’t doubt that pitch sequencing plays a part, but I’m curious if I build this up one piece at a time how much/little each additional piece will play a part in describing pitching performance.

  8. Jon Roegele said...

    These are interesting questions and ideas, Matt.

    I’m in the process of working on developing a metric for evaluating pitchers based purely upon the quality of their pitches, which will be determined solely by characteristics of pitches like pitch type, location, velocity and movement. It essentially uses league-average wOBA against for each pitcher-batter handedness combination, pitch type, strike zone sub-zone, velocity quartile, etc. as its guide for pitch quality. I’ve realized to describe actual performance better, I also have to adjust for park effects, and maybe league differences as well.

    Anyway, if this works to a point, one of the things I liked about it is that it could certainly be used at a game level to produce a “pitch quality” game score. It could tell you how well a pitcher was locating pitches at which velocities, etc. without regard for the actual outcome of the plate appearances. It would be another way to evaluate pitching that is certainly different from any I’ve seen, in that it just looks at pitch characteristics to make the evaluation and ignores all outcomes. It treats every pitch independently.

    This would never be as descriptive or accurate as those suggested in the Tango article without looking at K, BB, R, etc. I just am fascinated by how much I might be able to describe using this method over both a long period and just one game.

    I guess I’ll see how descriptive it will end up being after I’ve added in all of the pieces that I can.

  9. NatsLady said...

    Very troubled by FIP, especially since the Nats have a particular directive to the pitching staff (including Strasburg) NOT to go for strikeouts.  Stras was very efficient (7IP 80 pitches), so efficient that outsiders were complaining that he should have been left in longer. 

    I don’t want to bow down to the pitch count god, but if coaches are saying, “You have a good D, use it” then FIP alone is missing something, because pitchers’ goal is NOT strikeouts, pitchers’ goal is to be ABLE to strikeout when they are in a jam.

  10. Tangotiger said...

    Natslady: there’s nothing to be troubled by FIP.  It simply tells you who has the best combination of K, BB, and HR. If Stras ends up with 3K per start with 0 BB and 0 HR every 7 IP start, he’ll league the lead in FIP.

  11. Cliff Blau said...

    fWAR isn’t stupid for looking at components. The problem with just using FIP is that it leaves out certain aspects of the pitcher’s job, such as his ability to hold runners on and induce doubleplays.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>