I don’t have too many complaints about baseball’s replay review system. It’s not great that it leads to a lot of standing around, though I find the manager/umpire banter before the video coordinator gives a thumbs up/thumbs down pretty funny, and it involves challenges rather than using strictly booth review, but those have turned out to be small issues relative to the benefit of fixing some important calls.
There is one thing, though, that I find pretty unsatisfying about the replay system. Section III of the portion of the rules applying to instant replay says that in the absence of “clear and convincing evidence” the original ruling should stand, and Section II.J.3 outlines how calls are confirmed or overturned in the presence of such evidence, but “stand” if the review is unclear. This is the part of the actual replay review mechanism (as opposed to the system for triggering reviews) that might need an overhaul, and I think that overhaul could, handled properly, provide additional transparency, which baseball could always use a bit more of.
On its face, allowing a call to stand seems pretty unobjectionable. After all, the umpire on the field is an expert at these sorts of things and got to see it up close, and so we should give him the benefit of the doubt. In practice, though, that means you end up with calls like this:
Those aren’t cherry-picked; in fact, they’re three challenges with a result of “call stands” selected at random. If you made me guess, my reaction to those is that the calls on the field were wrong, right, and wrong, respectively, though I certainly wouldn’t claim any of those are conclusive. These calls exemplify two problems with the current system. First, having calls that look wrong but stand is a problem. It makes the umpires look bad (unfairly), it makes the broadcasters looks bad, it upsets fans and players, and it undermines everyone’s faith in the system a little bit more. (It could also be a source of substantial controversy. Think about what would have happened had the original call on Eric Hosmer’s double play in Game 7 of this year’s World Series been upheld due to lack of evidence.)
The second issue is revealed in the second clip, where, as Vin Scully points out, it’s unlikely that the ump actually saw the foot off the base; put another way, he didn’t actually have a good angle that the camera didn’t. Given that an umpire’s hidden expertise is at least part of the basis for letting the call stand if replay is inconclusive, we should probably reconsider if it’s actually the best method in light of results like that. With 361 inconclusive challenges last year, or approximately one every seven games, this is a real area for improvement.
In my eyes, a replay system should have three goals: it should correct as many calls as possible, it should be quick, and it should be transparent, or failing that, simple. (It should also be reasonably consistent, in the sense that two independent reviewers would come to the same conclusion most of the time, but that goes along with correctness and simplicity.) The current system, when facing a borderline call, is reasonably transparent and simple, but it’s not correcting as many calls as it could be. The three calls above might not have been wrong (though I think two of them were), but some of the ones that stand must have been. Whether the current system is fast enough is a matter of opinion, but it’s clear that the borderline reviews are slower than others. Depending on how you choose to measure it (mean or median, with or without controlling for type of call and initiator of challenge), the typical call that stands takes 40 to 50 seconds longer to review than a call for which there is conclusive evidence. It’s thus clear that, at least in theory, there’s some room for improvement.
One way of changing how inconclusive replays are resolved could be called population-based resolution (PBR), which would entail predicting the probability that a given call was correct based on certain characteristics of the population of replay challenges, then drawing a random number to determine the result. For instance, let’s stipulate that there was an 80 percent chance that the call on the field in the second video was correct. Ignoring for a moment how we got that probability, the ump in New York would use a computer to pick a random number such that 80 percent of the time the call stood, and the other 20 percent it was overturned.
PBR has some noticeable shortcomings, both in principle and in practice. There’s the obvious issue of how the probabilities would be determined, but what’s even more important is that while PBR would be overturning the right number of calls, it probably wouldn’t overturn the right calls. The reason I suggest it at all is that this is a more general case of the current method used to adjudicate inconclusive cases.
The current system can be interpreted as follows: “The call was inconclusive, so we are assuming it is like all other calls we didn’t overturn and has a 100 percent chance of being correct.” The proposed population level inference could be viewed similarly: “The call was inconclusive, so we are assuming it is like all other force plays where the challenge was initiated by the umpire and has a 64 percent chance of being correct on the fields.” From the perspective of getting a call correct, the current system is just a very blunt form of PBR, which is why it seems to me like a frustrating abdication of the purpose of replay.
Thankfully, though, there’s what I think is a much superior option, which is to lower the standard of evidence. Instead of requiring that the replay provide “conclusive evidence” that the original call was wrong, just instruct the replay official to pick whichever call looks better from the replay (a “preponderance of the evidence” standard). The reviewer can consider whatever is necessary, so if the field umpire had a good view, that will count, but if the original angle was poor, the call on the field can be disregarded. (In a perfect world, it’d be great to do the review as blindly as possible—no knowledge of the direction of the original call or who made it—but it’s unlikely that all of that can be consistently edited out of video in enough time to keep replays proceeding at an appropriate pace.)
How does this rate on the three criteria I mentioned above? Unfortunately, it’s hard to say right now; fortunately, it wouldn’t be very hard for the commissioner’s office to study. Bring the umps to the league office in New York, and have 10 umpires look at each call that stood for lack of conclusive evidence plus some additional conclusive calls for benchmarking. That might seem like a lot of work, but if you do the math, it works out to a bit less than a business day of reviewing per ump, which is hardly an extreme expense.
These data would tell us (or really, the league) several important things. They would provide estimates of how many calls this would actually affect over the course of the system, how long these reviews would take, and how reliable the umps are when forced to make a call about 50-50 reviews (which they currently aren’t forced to do). These estimates are essential in figuring out how to rejigger the current replay system.
My suspicion, based on little but guesswork, is that the new system would be a few seconds slower on average, a bit less consistent (meaning that different umpires would disagree about calls more frequently), but probably increase the probability that a review yields the correct result by a substantial amount.
How substantial? We can do a rough calculation using assumptions about the fraction of inconclusive reviews that were actually correct to uphold the call and the probability that an umpire will make the correct determination under the new system. For instance, if 60 percent of the inconclusive reviews were of incorrect calls, then tacitly the current system makes the right call only 40 percent of the time. If a replay official will make the call correctly 80 percent of the time, then a non-obvious review is more likely to be correct by a margin of 40 percentage points, which would correspond to roughly 150 extra correct calls last year.
It is, of course, also possible that this proposal wouldn’t work so well. If the new review standard requires longer reviews, inconclusive plays are hard to reliably assess, or most of them are found to be correct calls, then the gains won’t be in line with my guess above (or, if the drawback is speed, won’t be worthwhile given the additional drawback). In that case, the current system can be left in place, and everyone can at least know that other options have been considered.
Ultimately, though, there’s no use substituting speculation for research. With results in hand (and ideally publicly released), it wouldn’t be too hard for the league to crunch some numbers and figure out the costs and benefits of a more nuanced replay system. It’s not a pressing issue, but it’s an area of the game that’s reasonably straightforward to improve, and unlike other pushes for umpiring transparency, it wouldn’t involve the criticism of individual umpires. As the league keeps trying to improve its umpiring, it’s an obvious place to start.
References & Resources
- Retrosheet’s Expanded Replay Usage data
- Baseball Savant’s MLB Instant Replay Database