Are Foolish Consistencies the Hobgoblins of Starting Pitchers?

by Sal Baxamusa
November 6, 2006

How often do you hear an announcer or analyst say, “Johnny Flamethrower has great stuff, he just needs to be more consistent?” Consistency always seems to be the bugaboo that keeps Johnny from becoming an elite pitcher. Too often, however, “consistency” is just a code word for “better.” When a pitcher throws a two-hit shutout five days after giving up six runs in five innings, asking him to be more consistent is asking for two 7 IP/3 R games. So do we want the pitcher to be more consistent, or do we want the pitcher to consistently throw like he did during his two-hit shutout? Of course it’s the latter. We don’t want him to be more consistent; we want to be consistently better.

It may seem semantic, but it’s an important distinction. Are pitchers whose performances don’t vary much from start to start more valuable than pitchers who distribute their performance over a wider range? David Gassko looked at this last April and used a toy problem to show that the inconsistent pitcher is actually slightly more valuable than the consistent pitcher. It’s a fascinating question because performances in baseball are not perfectly continuous—each game is a discrete event with a binary outcome. I am not sure if baseball players can control the distribution of their performances, so the predictive value of breaking down performance discretely is of unknown value. But for judgments about value—the kind of arguments for which the internet was invented—it certainly is interesting.

With that in mind, let’s look at the National League Cy Young candidates. Three pitchers have pretty good cases: Brandon Webb, Roy Oswalt, and last year’s winner Chris Carpenter.

Pitcher		IP	RA	FIP
Webb		235.0	3.49	3.20
Oswalt		220.2	3.10	3.32
Carpenter	221.3	3.29	3.47

At first blush, all three candidates look similar. Webb pitched more innings but allowed the most runs; Oswalt tossed the fewest innings but had the best RA of the three; Carpenter was pretty much in between as far as allowing runs but received a good bit of defensive support and didn’t match Webb’s innings total. Without thinking too much about it, I’d think (in a fair world without the BBWAA) Oswalt was the front runner based on run prevention, with Webb next based on innings totals, and Carpenter trailing both.

Can the distributions of their performances tell us anything? Let’s set some ground rules first: any innings not pitched by one of the three candidates is left to a bullpen that allows the 2006 NL average of 4.88 runs per nine innings. The winning percentage for that game is then determined by taking the Pythagorean winning percentage based on the runs allowed in that game and the NL average runs per nine innings. For example, Carpenter pitched six innings and allowed four runs in his start against the Brewers on August 4, so the expected number of runs allowed in that game are 4 + 4.88/9*(9-6) = 5.63. Call this number “Expected Runs/Game,” or ERG.

By using the ERG and the average of 4.88 runs per nine innings, we can compute the Pythagorean winning percentage for this game as .429. By summing this number over each game started, we get an idea of the number of games the team could expect to win with this pitcher on the mound. Call this “Expected Team Wins,” or ETW. (I initially thought that a better way to do it is to use winning percentage by runs scored, but I had difficulty reconciling this with the fact I would have to mix a non-discrete measurement [ERG] with a discrete one [winning percentage as a function of runs scored]. I ultimately chose the Pythagorean method for simplicity’s sake.)

This is obviously not a perfect system, since bullpen usage is different based on game situation and road losses do not require a full nine innings pitched. There are also no park or defense adjustments. Still, as a first pass approximation, this is good enough. The following plots are histograms showing the distribution of performance for these pitchers.

Carpenter’s performance had the widest distribution of the three, Oswalt the narrowest. In particular, Carpenter had a large number of games in which he gave up zero, one, two or three runs. He also had a number of stinkers in the seven-to-nine range. Oswalt was the closest to being consistent, never tossing a shutout, frequently giving up four or five runs, and rarely being touched up for a number of runs without throwing many innings. Webb is somewhere in between the two. Also on the plots is the average ERG, standard deviation of ERG, and the ETW for each pitcher. (Chris Carpenter’s ETW is prorated to 33 starts from 32 starts for the pedagogical purposes.)

Ranked by ETW, Carpenter comes out ahead, then Webb, then Oswalt—exactly the opposite of my initial impression based on gross averages. One important thing to notice that less than one win separates Carpenter from Oswalt, so we’re not talking about back-breaking differences here. Still, looking at performance distribution has turned my first guess on its head.

Another thing to notice (surprise!) is that ETW tracks with standard deviation of ERG. Indeed, it appears that Oswalt’s consistency worked against him. The somewhat bimodal distributions for Webb and Carpenter worked in their favor, especially the shutouts. Allowing zero runs is always a sure win, and allowing three runs was a win 69% of the time last year. But allowing four or five runs per game last year, the most frequent ERG for Roy Oswalt last year, was only a win 53% or 43% of the time. Better to skew your performances to the extremes as Carpenter did than toward the middle as Oswalt did. If the job of the pitcher is to give his team the best chance to win his starts, then Chris Carpenter deserves serious consideration for being a more effective starter last year than Roy Oswalt—despite the .19 difference in ERA.

To be fair, I prorated Chris Carpenter’s ETW to 33 starts for the purposes of comparison. In reality, his 32 starts give him 21.26 ETW, less than one-tenth of a win above Oswalt. But the point of this exercise is not to choose the NL Cy Young award winner. Rather, it’s evidence that the manner in which players distribute their performance can affect their value to the team. I think that further investigation will show that this almost certainly true. Whether or not it is predictive is a more interesting (and more complex) question. Can players, and therefore teams, actually control the distribution of their performance?

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG