Accusing someone of being racist is a bold and emotive claim. But that is exactly what Time does in writing up a study from Daniel Hamermesh, a respected professor at McGill University, who released an academic paper on the subject. To be fair to Daniel he doesn’t use the word racist anywhere in the study—in economist lingo the term is racial bias: Umpires are more generous in their calls to pitchers who are of the same race—in other words they exhibit racial bias. Perhaps the difference is that racial bias is a subconscious act, whereas racism is an overt choice, but either way this has the potential to be an inflammatory issue.
Before I go on it is worth acknowledging the slew of blogs and individuals that have contributed to the debate and on which I draw a lot for this column. First, Phil Birnbaum at Sabermetric Research has a couple of extremely lucid posts on the subject. Phil edits SABR’s By the Numbers journal, and his blog is one of the most accessibly written on statistical matters. A lot of the studies I refer to originated from Phil. Also MGL of The Book Blog has tried to replicate the Hamermesh result but through different techniques. During this column I’ll refer to his work too. Also I’ll draw on the expertise of some of the great minds of sabermetrics; people like Guy, Tango and Pizza Cutter have all thrown their tuppence worth into the ring and added significantly to the debate.
We’re done with the bluster; let’s investigate the study in a little more detail.
In summary, Hamermesh et al look at pitch-by-pitch data from ESPN for every MLB game between 2004 and 2006—a total of 2.1 million pitches. For each pitch they recorded the outcome (eg, called strike, swinging strike, hit by pitch, ball in play) as well as other game information such as the pitcher (including his line), umpire crew, team standings, attendance, mascot name etc.—you get the idea. The pitchers and umpires are classified according to race as either White, Black, Asian or Hispanic. A combination of databases and image searching is used for this classification, which seems a reasonable method.
The authors then did a couple of things to try to detect racial bias. First they tabulated the data by umpire and pitcher race to see if any clear differences popped out. Then they ran some complicated regression models to try to tease out hidden relationships, pin down firm results (statistical significance in the trade) and identify the implications for the game (baseball significance).
So, what did the study find? First, let me present the simple data tabulation:
Summary of Umpires’ Calls by Umpire-Pitcher Racial/Ethnic Match Pitcher race Umpire race White Hispanic Black Asian TOTAL White Pitches 1,388,318 445,107 47,797 56,866 Called strikes (%) 32.06 31.47 30.61 31.97 31.89 Hispanic Pitches 45,603 13,737 1,552 1,406 Called strikes (%) 31.91 31.8 30.77 30.43 31.81 Black Pitches 87,170 26,054 3,377 3,179 Called strikes (%) 31.93 30.87 30.76 30.19 31.62 TOTAL Called strikes 32.05 31.45 30.62 31.84 31.87
The authors use called strikes to detect possible racial bias by the umpire. A called strike is a subjective ruling, so the theory is that if a white ump favors a white pitcher he is more likely to call pitches on the margin of the zone a strike rather than a ball. The table is a bit busy, but we can infer a few things from the data. First, there are no Asian umpires in major league baseball (hence the lack of data). Second, white umpires and pitchers are in the majority accounting for 90% and 60% of their respective populations, which is a lot and as you will see plays an important role when interpreting some of the authors’ conclusions. Third, it doesn’t matter what race the umpire but white pitchers have a greater ability to throw strikes than do non-white pitchers.
That’s an important finding. White hurlers throw more strikes. Look at the data. Black umps call 31.93% strikes when a white pitcher is on the mound compared to 30.76% when the pitcher is black! Accepting the raw percentages, the most discriminated set of pitchers are Asian by black umpires, although that particular combo only accounts for 3,000 of the 2,000,000 pitches recorded; at the other end of the spectrum white umps call strikes for white pitchers 32.06% of the time.
As well as looking at the raw numbers the authors ran a few regressions as is their wont. The benefit of a regression is that you can control for a bunch of different factors. The factors that are controlled are: pitch count, inning, home-field and game score.
Interestingly when considering the race of pitchers and umpires separately (eg, white on white, white on black etc.), the authors find no racial bias—this is attributed to sample size issues, particularly among non-white pitchers and umpires. Only when clumping all the data together is any effect observed. The most important finding is that when the pitcher and umpire are the same race then a pitch is 0.34% more likely to be called a strike—which the authors claim is equivalent to just less than one pitch per game. In fact it is closer to one strike every five games, but that is by-the-by (there are approximately 70 called pitches a game). Think about that for a second. Is that evidence of racial bias? Is it significant in a baseball sense? We’ll come back to those questions later.
The authors then look at a number of other factors. The first is whether the umpire was in a Questec park. Questec is strike zone recognition software that can grade an ump’s performance by assessing how accurately he called the strike zone. The theory is that in Questec parks we should see any evidence of racial bias disappear because the monitoring system makes umpires more vigilant. Eyeballing some of the data it appears that different race umpires actually call more strikes than same race umpires in Questec parks—in other words over compensating for the race effect!
A regression is used to try to add some statistical validity to this finding. First the authors find the “Questec effect”, that called strikes are less likely in those parks with Questec installed (0.66% per pitch). Then the authors looked at same race umpires and found that they called about 1% fewer strikes in Questec parks, which again points to racial bias in non-Questec parks.
I can sense you’re running out of gas so let me wrap up by sharing a couple of the remaining conclusions. The authors also found a link to attendance—the more watched the game the less the racial bias—and also to terminal pitches (where there are two strikes and/or three balls on the board). Again, in terminal counts there was less racial bias. The inference is that the more scrutinized the situation, be it Questec, the public or the media, the more the umps were likely to adjust their inherent tendencies.
Do umpires show racial bias?
The million dollar question! The Hamermesh study shouts a yes, but it is unclear. In my opinion benefit of doubt probably sits with the umpires unless we can categorically prove the contrary. Also the effect, if it exists at all, is so small as to not make any practical difference.
Phil Birnbaum’s analysis is probably the best refute of the racial bias argument. Consider the called strike line in the table above. This gives us:
Pitcher Umpire White Hispanic Black Asian TOTAL White 32.06 31.47 30.61 31.97 31.89 Hispanic 31.91 31.8 30.77 30.43 31.81 Black 31.93 30.87 30.76 30.19 31.62 TOTAL 32.05 31.45 30.62 31.84 31.87
Phil’s argument, which is correct, is that if racial bias exists then we should be able to see it in this table. He hypothesizes that an unbiased table could have the same called strike percentage across all the pitchers.
Pitcher Umpire White Hispanic Black Asian TOTAL White 32.05 31.45 30.62 31.84 31.87 Hispanic32.05 31.45 30.62 31.84 31.87 Black 32.05 31.45 30.62 31.84 31.87
White pitchers hit the strike zone more often than pitchers of a different race do, and this is reflected as the umps call the same strike percentage irrespective of ethnicity. Phil calculates the number of pitches that need to be called differently to translate the top table to the bottom. The answer? Precisely 228, that’s how much. That’s out of 2 million actual pitches and 700,000 called pitches. Yes, just 228. It doesn’t sound like a big difference and as one standard deviation in the data is about 500 pitches, it assuredly isn’t. A quick way to verify this result is to run a Chi Squared test. A Chi Squared test compares an observed distribution to an expected distribution and reports whether they are similar or different. In this instance the test confirms there is no significant difference between the two matrices.
Of course there are other unbiased tables, for example, black umpires could consistently call a small strike zone across all pitchers. Phil shows that these tables aren’t statistically different from the actual data. Although we can’t find any statistically valid result using this approach the table still hints at some racial bias (look at Asian pitchers vs black umps). It could be that by controlling for other factors, as the regression does, allows us to peer behind the superficial numbers and detect racial bias; or it could be that there is something funny going on with the authors’ analysis.
So how did the authors managed to show that there was a significant effect? There are some possible clues in the data. If you look closely the only racial bias appears to be among Hispanic and Asian pitchers, particularly when a black ump is behind the plate. This could be skewing the data to show racial bias as a general conclusion when it only really exists in patches—and these pockets of differences are based on relatively small sample sizes. Also there appears to be little difference in called strike distribution among the black and white pitchers save that white pitchers generally have a higher called strike percentage overall (look at the table above).
Another clue is that when looking at pitcher and umpire race individually no effect can be found, it is only when combining the data into same race and different race buckets is the effect revealed. Yes, the aggregation appears to be turning the regression significant. If you calculate a weighted average of same race umpires the called strike percentage is 32.05% versus 31.5% for different race umps—a 0.6% difference. We know that white pitchers have more called strikes so if we look at white hurlers against the rest the percentage called strikes are 32.05% and 31.4% respectively—a 0.7% difference. That’s it. Most of the difference is explained by the fact that white pitchers have more called strikes and white umpires take charge behind the plate more often!
You’d expect the regression to capture the result, so why doesn’t it? That is still a little unclear. There are a couple of likely explanations. One, by aggregating the data the sample size becomes large enough for a miniscule effect to be teased out. Or two, the regression is somehow failing to control for race properly.
So, which is it?
My hunch is that while there may be a tiny effect among some umpire groups, the authors’ regression hasn’t properly controlled for the race of the pitcher. Ninety-five percent-plus of the umpire/pitcher pairs are white, so this variable is a proxy for white pitchers rather than same race umpire/pitcher combinations. To test this the authors could run the regression with pitcher = white as a variable and if the above theory is true they should get the same results. I suspect that controlling for pitcher race would make the claimed effects small to negligible. For the main regression result (the 0.34%) there is no individual pitcher or pitcher race control (although, strangely, this is controlled for in later regressions).
Another issue is that there are only very few non-white umpires in the bigs. The authors have identified three Hispanic umpires and five black umpires compared to 85 white umpires. It could easily be that one anomaly among the non-white umpires skews the data. Drawing conclusions about racial bias on such a skewed sample is a little dangerous.
What about the other findings, for instance, that playing in Questec parks reduce the likelihood of racial bias among same race umpires? I suspect that the same issues highlighted above are in play. Questec is only in a small number of parks (11 out of 30) and teams and umpires will not be distributed randomly among them (home teams will be over-represented). The fact that we see over-compensation from the graphical data is also strange. Both these facts have to put the Questec conclusions in some doubt.
The attendance finding is also spurious. Why would a better attended game result in a less biased strike zone? If you’re in the crowd behind the plate can you really call the strike zone? No. And every game is televised—there is more scrutiny from that than there is from a few fans behind home plate whose view is obscured by the umpire.
If there is bias is it a big deal?
Not really. Guy, commenting at Baseball Think Factory made the following observation:
For example, a black pitcher gets the call 30.62% of the time, with these racial “disparities”: W Ump: 30.61, H ump: 30.77, B Ump: 30.76. So, if a black pitcher was judged by a same-race (black) ump 91% of the time (as white pitchers now are), he would gain one extra called strike for every 784 called pitches. A starter has about 50 called pitches per start, so a black starter might get 2 more strike calls in a season. Let’s be generous and say that results in 1 fewer walk (or 1 more K) each year (it’s probably less than that), in which case he would give up 1 additional run every 3 years or so.
In essence, even if you buy that an effect exists in a statistical sense, in a baseball sense it is practically irrelevant. If we turn our attention to the regression result, which concluded that a same race umpire called 0.34% more strikes the baseball effect is equally small—about 5 to 6 calls per season by Guy’s math.
Is a number of this magnitude even worth discussion? Can a pitcher really tell that an opposite race umpire is penalizing him with a tighter strike zone? I don’t know about you but in my book one called strike out of 300 is indiscernible. If that one strike is converted into a ball the swing in run expectancy is somewhere around 0.1 runs, which if you add it all up is about a win every 30 years or so.
One of the issues with regression analysis is that it is a bit of a black box. You pop a shed-load of data and control variables in and see what you get out and from that try to infer a conclusion. As a researcher you often end up running a ton of different regressions controling for this and that before getting the result you want. This has a couple of problems. One, if you are operating at a 5% significance level that means that one in 20 of these studies will produce a false result by chance—which is quite high. And two, as we have seen it can be very difficult to tease apart real effects in the data and to assert baseball significance (rather than statistical significance).
MGL, of The Book Blog, tried to replicate the author’s study using Retrosheet data (as opposed to ESPN data) but by making a series of adjustments rather than running a regression. I won’t bore you with the technical details but MGL’s conclusions are that racial bias among umpires, if it exists, is small enough to not worry about. If there is any racial bias then it is most likely to be among the Hispanic umpire group. Unfortunately, this group only contains two umpires and erratic performance by just one (race-based or otherwise) can dramatically skew the results. (Note: MGL could not identify the third Hispanic umpire in the Hamermesh et al study). This is in-line with our analysis above.
This study has provoked a lot of reaction, particularly among the analytical baseball community. To accuse someone of racism, or racial-bias, is a grave statement and one must be sure that there is sufficient and rigorous evidence to back up that stance. If there is any reasonable doubt about that claim then the benefit should be with the accused unless proved otherwise. Once such an accusation has been levied to the fans and general public the mud sticks.
In this study a few things are clear:
- There is no discernable bias that we can see from white umpires to any race of pitcher. The majority if not all of the difference in call rates can be explained by the fact that white pitchers record more strikes than non-white pitchers. Also the historically contentious white/black interaction shows no evidence of bias
- There may actually be racial bias but if there is then we can’t detect it in the data. For example, in reality it might be the case that white pitchers are worse than non-white pitchers at hitting the strike zone but that all umpires are biased towards white pitchers. If this is the case this study will not detect it
- There may be some small bias from Hispanic umpires. However, there are only two to three Hispanic umpires in the bigs so it is impossible to draw broad brush conclusions
- The Questec data is more or less meaningless because we start to hit small sample size issues particularly regarding the quality of pitchers in Questec parks when non-white umpires are adjudicating
- That race of the batters had no impact on the umpires’ calls is more evidence that bias probably doesn’t exist.
In order to get a firm grip on the issue some new and different studies need to be done. First, this work should be replicated for different years. The authors covered the period 2004-2006 but the Retrosheet data (with good pitch-by-pitch logs) are available for the previous 10 years at least. Second, we can use Gameday data to uncover any real bias in the data.
Gameday can tell us whether a pitch was a strike or a ball and then we can see how the umpire called it—this will allow us to understand the exact bias and the direction of the bias.
As things stand we are in an uncomfortable position. As far as the general public and baseball fans are concerned there is now evidence of some racial bias among umpires. Moreover the authors have exaggerated the impact of that perceived bias on the outcome of the game. That this information has been plastered all over the mass media will popularize that view.
The burden of proof is on the authors to justify these claims and at the moment the weight of evidence suggests that the authors have got it wrong. And that is bad for everyone: the players, umpires and baseball in general.
References & Resources
A big thanks to everyone who has contributed on this topic but in particular: MGL, Tom Tango, Phil Birnbaum, Guy M and Pizza Cutter.