Measuring the umpire’s effect on the game

by Nick Steiner
August 17, 2009

We all know umpires make mistakes, especially when calling balls and strikes. While some people will argue that those mistakes are part of the game, there are few who are able to make a convincing argument to support that. On the other hand, the blogosphere is littered with arguments against umpires and for a computerized zone.

I’m not here to take sides (I’m actually firmly in the camp for human umpires), instead, I wanted to take a look at how much of an impact umpires actually have. First, I wanted to see how accurate umpires are. To do that, I went into my Pitch f/x data and grabbed all pitches that were called strikes and all pitches that were called balls. Then, I mapped them out onto an approximation of a major league strike zone.

I used the average top and bottom hitter zones provided by Gameday, 1.6 and 3.4 feet above ground respectively, as the vertical ends of the strike zone. Than I used the official major league horizontal zone (17 inches) and added two inches of leeway to each side. I also normalized the vertical position of each pitch ball to batter height.

Here is what I got:

Remember that this is from the catchers point of view.

As you can see, there is significant overlap. While umpires are pretty good at judging the high end of the strike zone, they are absolutely dreadful at judging the bottom and the sides, especially the third base side. Overall, 9.1% of pitches that were called balls were inside of the strike zone, and 21.7% of pitches that were called strikes were outside of the strike zone. I find that second figure outstanding, especially given that I am already giving the umps a pretty lenient strike zone.

If I change the perimeters of the zone to 2 feet both ways, than those percentages become 16.5% and 11.6% respectively. So, assuming that the umpires have no bias towards hitters or pitchers, the “real” zone is likely somewhere in between that.

John Walsh already did some great work a couple of years ago on figuring out the “real” strike zone, and I may try to update that later having the benefit of more accurate Pitch f/x data. However, for now, I wanted to take a look at this from another angle.

Those percentages I quoted above are huge numbers. Any way you swing it, it appears that the umpires are only about 85% accurate, at least this year. That leaves a lot of room for random variation among players. How much? Well, let’s find out.

I queried all pitchers this year who have thrown at least 500 pitch in baseball this year, 329 in total, and sorted each pitcher by the number of pitches called strikes that were outside of the strike zone minus the number of pitches called balls that were inside of the strike zone. Then I divided by total pitches to get it on a rate stat. Then I multiplied that by 100 pitches, or roughly one game, and named that “Gift Rate”. Here are the results shown graphically:

In case it isn’t clear, the x axis is all pitchers who’ve thrown at least 500 pitches this year.

You can read that as the number of “gifts” minus the number of “squeezes” each pitcher receives per game. You can see that despite the old adage, it does not all even out. Some of that may be due to measurement error, as I don’t profess my strike zone to be very thorough and there still may be problems with the Pitch f/x data (namely park effects), and there may be some sampling error as well; however, it’s clear that umpires effect some pitchers more than others.

The standard deviation of Gift Rate among pitchers this year is about 1.6, which means that 68% of pitchers will have up to 1.5% difference in their strike rate based on umpires alone. That may not sound like a lot, but consider that, based off of this years data alone, there is an R^2 of about .62 on strike% vs. BB/9. The average difference in walk rate among guys with a 1.6% difference in their strike% is about .4 which is pretty significant.

Going by the FIP formula, if you added .4 and subtracted walks per 9 to a league average pitcher, their FIP would rise by about .20 points. Obviously this doens’t consider how strike% effects K Rate, and other factors. So in order to get more actionable numbers, a more rigorous study needs to be applied. However, it serves a reasonable illustration of the impact that umpires can have.

Now, here are the pitchers who have been getting the biggest help this year:

1) Derek Lowe: 5.6
2) David Weathers: 5.5
3) Javier Vazquez: 5.1
4) Mariano Rivera: 5.0
5) Livan Hernandez: 4.9

And here are the guys who have been hurt the most:

1) Dontrelle Willis: -3.6
2) Brandon League: -3.4
3) Charlie Morton: -3.2
4) Ryan Rowland-Smith: -3.1
5) Dana Eveland: -3.0

It’s hard to see any sort of bias in those lists. Among the leaders, you have two of the best pitcher in baseball (Mariano and Vazquez) and two of the worst (Hernandez and Weathers). The trailers are filled with guys with abysmal control, like Willis, and guys with good control, like League. For those who want it, here is the complete list (pitchers are labeled by their Elias ID and my SQL is acting wonky right now, so you’ll have to do some translating).

The next step, along with creating a more accurate strike zone, is finding how much of an impact those missed calls have. We all know that there a certain missed calls more significant than others; however, as I showed earlier, the potential impact of a lost or gained strike may be pretty significant in itself.

We use FIP, tRA and other such metrics to eliminate defense and other kinds of luck from pitcher ability. However, it’s possible that umpires themselves may have as big, if not more, of an effect.

25 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Detroit Michael

14 years ago

Is there any trend from year-to-year for the same guys to be helped or hurt by their calls? For example, maybe Mariano Rivera has sharp enough control that he can consistently stretch his personal strike zone a bit.

Nick Steiner

14 years ago

Good question Michael

Last year, Mariano had a 5.00 mark! And Lowe had a 5.6 mark! When I have time later today, I’ll run a correlation of all pitchers from 08-09, and try to see how much of a “skill” it may be.

Matt Mitchell

14 years ago

I can make sense of why umpires would miss more on the low pitches. What doesn’t make sense to me are the inside pitches to right-handed batters, since any umpire clinc I’ve been to has taught me to set up on the INSIDE part of the plate. I would guess that even if the data is split by RHB/LHB, there’s still a really high number of pitches being called strikes that are too far inside according to Gameday. Could this be umpire hubris, miscalibrated cameras, or something I’m not even thinking of? I’m sure many will lean towards the first option, but I’m not sold on that.

Dave

14 years ago

The problem with this data is the strike zone. You simply cannot use an approx SZ or an average SZ, it just doesn’t work. SZ sizes can change dramatically from batter to batter. To get an accurate feel of an ump’s impact on a game bases on his strike/ball calls you have to really break it down batter-per-batter.

Nick Steiner

14 years ago

Dave – I adjusted for individual batter height by normalizing the pz values of each pitch to the league average strike zone. MLB fortunately gives us an approximation of each top and bottom zone of each hitter.

So, For example, if MLB gives me a bottom zone of 2.0 and a top zone of 3.8, and the pitch thrown was at a veritcal position of 2.5, to adjust that to the league average zone of 1.6 and 3.4, I would use this formula…

(2.5 + (((3.4 – 3.8) + (1.6 – 2.0))/2))

… so my adjusted vertical position would be 2.1. Unfortunately, there isn’t much way to tell the “right” horizontal position, besides splitting it up by handedness.

J.R.

14 years ago

David Weathers’s ERA+ over the last 10 years:

148, 177, 137, 136, 103, 108, 132, 130, 139, 113

One of the worst pitchers in baseball, indeed.

Nick Steiner

14 years ago

J-R and Luke

Weathers’ FIP this year is just under 6. He’s not that bad, but for the purposes of this study, which is only looking 2009 data, he fits the description of “one of the worst pitchers in baseball”.

J.R.

14 years ago

His FIP might be just under 6, but his tRA while with Cincy was 4.37. FIP ignores a whole lot of stuff, and given that it’s nowhere near any of his previous years, I’ll go with that as the outlier.

J.R.

14 years ago

His FIP might be just under 6, but his (StatCorner) tRA while with Cincy was 4.37. FIP ignores a whole lot of stuff, and given that it’s nowhere near any of his previous years, I’ll go with that as the outlier. His FIP and FanGraphs tRA is all over the place year-to-year, while his StatCorner tRA is much more consistent, and therefore (I would posit) more indicative of his true talent level.

John Beamer

14 years ago

Given the placement of some of those dots either

a) There is measurement error, or

b) Some umpires should be fired immediately

Good work.

ChuckO

14 years ago

Since you are not measuring how umpires feel about games, that should be “effect”, not “affect”.

Nick Steiner

14 years ago

Crap, fix’d.

Nick Steiner

14 years ago

JR- I just checked Statcorner. He really has done a good job of avoiding Line Drives!

For what it’s worth, FanGraphs has his tRA with the Reds at 5.36 this year, using different batted ball classifications (one of the reasons that tRA remains a somewhat unreliable in small sample size)

At any rate, he is 40 years old, and ZIPS projects a 4.50 FIP going forward, which is the definition of replacement level for a reliever. He may not be one of the worst pitchers in baseball, but he’s not good either.

Anyways, that line was a complete throwaway line made to compare him to Javier Vasquez and Mariano Rivera. I didn’t know David Weathers had so many fans at THT

Moe

14 years ago

Nice work.
One comment though: you say they are especially bad on the third base side. It may make sense to distinguish between LHP and RHP and LHB and RHB. In other words, are they always bad on the 3rd base side or are they bad when a pitcher is pitching inside?

Nick Steiner

14 years ago

Moe – you’re almost certainly right, and I’ll definitely look into that. This post was mainly supposed to be a “feel” post, for a subject that I think is being overlooked.

If anyone else has any suggestions/criticisms, that would be great.

Luke

14 years ago

How can you call David Weathers one of the worst pitchers in baseball? His past 3 years have been pretty damn good as I remember…

Siward

14 years ago

Recent correlation studies show tRA and FIP to have approximately the same amount of error in determining pitching performance. Given sample size issues on batted ball data, I’m much more willing to trust FIP in these instances, which has shown David Weathers to be far, far worse.

Additionally, tRA shows Weathers to be essentially a replacement-level pitcher for every year of his career sans 2003. In any case, I don’t think this is a snub of Blyleven-level proportions.

Dan Novick

14 years ago

Due to the batted ball data tRA relies upon, I’m inclined to believe it’s less reliable in a small sample size than FIP is. Strikeouts, walks, and homeruns all stabilize much more quickly than LD%, FB%, etc.

Mike Fast

14 years ago

Re John in the first comment, there are a nontrivial number of stringer errors in the data where the pitch type (Ball, Called Strike, In Play, etc.) is not matched with the correct set of PITCHf/x data, particularly if you download the data the night after the game was played.

Some portion of those errors will be corrected by MLBAM in the XML files on the Gameday site at a later time. However, some will remain. For instance, the first two pitches to Troy Tulowitski here were in fact correctly called balls by the umpire, but the PITCHf/x data assigned to those pitches appears to actually belong to pitches #3 and #4 of that at bat, which do not have any PITCHf/x data listed.
http://gd2.mlb.com/components/game/mlb/year_2009/month_07/day_31/gid_2009_07_31_colmlb_cinmlb_1/inning/inning_9.xml

When you see missing PITCHf/x data for a few pitches as in this example, it’s a clue that the stringer got mixed up.

J.R.

14 years ago

Good points about the small sample size issue re tRA. However, I was under the impression that tRA* was supposed to regress all the random stuff that pitchers, as a whole, have no control over to the league average. Weathers’s tRA* of 4.27 is actually a shade better than his tRA which suggests that it’s not necessarily fluky. His xFIP is also significantly better than his straight FIP, suggesting the same.

Mike Fast

14 years ago

If you want a contrary example of an umpire just missing a call, look at the 1-0 pitch to Orland Cabrera in the first inning of the July 30, 2009 game. It’s right down the middle, and the umpire calls it a ball. Oddly, neither Lester nor the catcher object at all. Cabrera offers bunt and then pulls the bat back; perhaps that confused the umpire.

http://gd2.mlb.com/components/game/mlb/year_2009/month_07/day_30/gid_2009_07_30_oakmlb_bosmlb_1/pbp/pitchers/452657.xml

Another example of a bad miss by an umpire is the 1-1 pitch to Hunter Pence in the top of the first inning May 14, 2009. It’s right down the middle but called a ball, but again nobody appears to argue, and again Pence made a check swing, which perhaps confused the ump?

http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_14/gid_2009_05_14_houmlb_colmlb_1/inning/inning_1.xml

Nick Steiner

14 years ago

So Mike, how should I handle that? Is there anyway to delete the bad data, or adjust it somehow?

Mike Fast

14 years ago

Nick, I wish I had a good answer for your question. I have been thinking about a good way to weed out the problem data, but don’t have one right now.

Jonathan Hale

14 years ago

I know it’s a bit of a nightmare, but don’t we (did you?) have to make some sort of correction for curves bending ‘around’ the plate? I think I remember a Josh Kalk article on THT saying that could amount for up to two inches of seemingly bad calls from umps.

Nick Steiner

14 years ago

That makes sense Jonothan. I didn’t make a correction for curves, I just adjusted for batter height and threw up a rough approximation of a strike zone.

Right now I’m working on gauging the “correct” strike zone. Splitting it up by pitch type and batter hand would certainly help. Then I could make corrects on each pitch from there. And then there is the bad data that Mike mentioned. Ugh… this is gonna take awhile.

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG