“It is the mark of an educated mind to rest satisfied with the degree of precision which the nature of the subject admits and not to seek exactness where only an approximation is possible.” — Aristotle
Earlier this season on my blog I reviewed the book Three Nights in August: Strategy, Heartbreak and Joy Inside the Mind of a Manager by Buzz Bissinger (yes, I realize it was a bit tardy), which chronicles a three-game Cubs/Cardinals series played in August 2003 from Cardinals manager Tony LaRussa’s perspective. I had special interest in the series since I happened to be in St. Louis on business that week and attended the dramatic third game the book leads up to. I won’t spoil it for those of you who don’t remember or haven’t read the book.
One of the passages I found particularly interesting was related to how LaRussa uses individual batter/pitcher matchup data. His basic philosophy was explained in the following passage.
“La Russa pays special attention to the individual matchups, an essential ingredient of his approach to managing … The term bench player doesn’t really apply to the Cardinals, because LaRussa so frequently plugs utility players into the lineup based on little opportunities he unearths by sifting through the results of their previous experience with players on the opposing team. These individual matchups are so integral to his strategy that he copies them onto 5-by-7-inch preprinted cards that managers normally use to make out the game’s lineup. With ritualistic precision, he folds the cards down the middle 10 minutes before game time and then slips them into the back pocket of his uniform. During a game, he pulls them out continually, almost like worry beads, peering at them as if in search of evidence that everything is fine, that he is doing exactly what he needs to be doing. More practically, he refers to them when deciding who to bring on in relief or who may be the best matchup to pinch-hit.”
Bissinger notes that La Russa knows that matchups aren’t foolproof but still …
“There are some hitters who, never mind their mediocre batting averages, simply tag the living crap out of some pitchers. Conversely, there are pitchers, despite soggy ERAs, who simply do well against particular high-stroke hitters.”
After then digressing about the roles human nature and psychology play in these matchups, Bissinger re-emphasizes the roles they play in the mind of LaRussa.
“Of all the hours spent preparing before a game, many of them LaRussa spends searching for the explanations of these matchup numbers, a slide of seemingly buried narrative that during a season can single-handedly change the outcome of the four or five games that—in La Russa’s estimation—a manager can change.”
What I found interesting in this entire discussion was the omission of the three most important things that leap immediately to mind when I think about matchups—sample size, sample size and sample size.
And that got me to thinking how one might measure whether a particular matchup is statistically significant. In other words, when LaRussa looks at his index cards, how does he know whether the 6-for-26 performance of Aramis Ramirez against Chris Carpenter over the past three seasons is simply Carpenter getting a little lucky against a good hitter or whether Ramirez really has trouble picking up Carpenter’s sinker? Or when he’s picking a pinch hitter and sees 2 for 13 for John Mabry against Greg Maddux, is that enough to choose someone else?
Obviously, LaRussa has other information at his disposal with which to make his decisions, including an understanding of Carpenter’s repertoire and the voluminous charts that pitching coach Dave Duncan keeps, as mentioned by Bissinger and detailed by George Will in Men At Work: The Craft of Baseball. But those of us a little more on the outside might use other tools to try and answer the question.
Enter the Mathematicians
While pondering this question as the season wound down, I remembered that Ken Ross in his book, A Mathematician at the Ballpark: Odds and Probabilities for Baseball Fans, had written about binomial distributions and p-values and related them to batting average in his chapter on assessing streaks. A note on SABR-L from Mike Huber, an associate professor of mathematics at West Point, asking basically the same question I had been thinking about, then served as the impetus for this article.
If it’s been awhile since you cracked open your Statistics 101 text from college, a p-value is the probability of an “observed event or any more extreme and surprising event.” For example, a p-value could be calculated that would indicate the probability that Cal Ripken would hit .340 or better in 1999, given his career .276 average prior to 1999. Ross does this in his book and calculates the p-value for such an event at .0052 or one-half of one percent. Obviously that’s a small probability, and to a statistician a p-value of less than around .05 or 5% is an indication that something more interesting is going on. In other words the event, in this case Ripken’s .340 average, is statistically significant. Of course there is nothing magical about .05 as the cutoff for significance, and in fact Ross notes that anything between .02 and .08 would cause him to look twice at the event.
There are two more pieces to the puzzle here, however, that need to be considered. First, in order to calculate the p-value, a statistician employs a model—a set of assumptions—that enables him to paint a mathematical picture of the situation he’s trying to study. In the case above Ross assumed that each Ripken at-bat approximated a Bernoulli trial (named after the Swiss mathematician Jakob Bernoulli, 1654-1705). Each Bernoulli trial makes the following assumptions:
“a) Each time there are two possible outcomes, traditionally called success and failure.
b) There is a fixed probability p of success each time.
c) The events are independent.”
As you can imagine from the assumptions this model is often referred to as a coin-tossing model, with the only difference being that when flipping coins we assume the fixed probability (p) is .50 or 50%.
The outcomes of Bernoulli trials can then be analyzed using a binomial probability model or binomial distribution. The binomial distribution function is used to calculate the p-value given the number of trials, the number of successes and the probability of success.
Now of course each trial in baseball (an at-bat) does not satisfy the three criteria of Bernoulli trials. In particular there isn’t a fixed probability of success on each trial because weather, injuries, game situation, opposing pitcher and a host of other factors complicate things. The best we can do is to assign a probability, such as career batting average, as our best estimate of p. In addition, at-bats are not independent. As Bissigner notes in Three Nights, a batter’s mindset when facing a particular pitcher, because of his last at-bat against that pitcher or even his last at-bat in the current game, may have much to do with the outcome.
With that said, in the end statisticians employ models, albeit imperfect ones, in order to provide a basis for the study of real-world events. When they find low p-values, that’s an indication that the assumptions of the model may not hold for a particular set of trials. And those are the sets of trials that together make up a statistically significant event.
For the analysis in this article I’ll assume that Bernoulli trials and a binomial distribution are a good proxy for at-bats at the major league level, following in the footsteps of Ross and Jim Albert, co-author of Curve Ball: Baseball, Statistics, and the Role of Chance in the Game, who uses it as the model in his book Teaching Statistics Using Baseball.
The second piece of the puzzle brings us back to sample size. It is intuitive that in the case of Ripken’s 1999 season mentioned above, Cal would have a greater probability of hitting .340 or above in 50 at-bats than he would in 600 at-bats if his “true” average were below .340. The reason is that he’ll be more likely to get lucky in those 50 at-bats where seeing-eye singles, bloopers, bleeders, squibbers and Texas Leaguers will have a larger relative impact. The binomial distribution function takes sample sizes into account when calculating the p-value. In other words, with fewer trials the p-value, all other things being equal, goes down along with the chances of the event being statistically significant.
Putting it all together, we can calculate a p-value and binomial distribution for Ripken’s 1999 season given that he had 113 hits (successes) in 332 at-bats (trials) given a fixed probability of .276 using Microsoft Excel’s BINOMDIST function. That function can be used to produce the following graph of the distribution.
Each point in the graph represents the probability of Ripken attaining that particular batting average given the assumptions of the model. As you’ll notice, the probability of Ripken hitting exactly .265 or .277 is only around 5%. However, the cumulative probability of him hitting over or under a certain average is represented by the area to the right or left of the average on the x-axis. The p-value that Ross calculated therefore represents the shaded area to the right of .340 in the graph below.
Getting back to the question at hand, I wanted to see how many and which batter/pitcher matchups would be considered statistically significant in order to get a feel for how seriously one should take the matchups that are often reported by announcers. To do so I examined play-by-play data for 2003 through 2005. From that data I found all the batter/pitcher matchups where the batter had 50 or more total at-bats in the three-year period and where the matchup yielded five or more at-bats. This left me with 30,481 individual matchups.
To use the binomial distribution we then need to calculate the probability of success (p) for each hitter. Although I could have used the batting average of the hitter over the three-year period, I chose instead to employ the “log5″ formula Bill James introduced in his 1981 Baseball Abstract. That formula takes into consideration not only the batting average of the hitter (BAVG) but also the batting average against the pitcher (PAVG) and the league context (LgAVG) to calculate an Expected Average (ExAvg).
ExAvg = ((BAVG * PAVG) / LgAVG) / ((BAVG * PAVG) / LgAVG + ((1-BAVG)*(1-PAVG)/(1-LgAvg)))
Dan Levitt wrote a nice article several years ago showing that this formula does a good job of predicting the outcomes for actual batter/pitcher matchups.
With the probability of success calculated all that was left was to run the numbers.
I calculated the p-value for each matchup and found that 956 of the matchups had p-values less than .05. In other words 3.1% of the matchups over the last three years would be considered statistically significant under the standard test used by statisticians. If the p-value is raised to .08 the number of statistically significant matchups goes up to 1,728 or 5.7%.
You may be wondering why less than 5% of the matchups had p-values of .05 when we would have expected there to be 10% (5% on both ends of the distribution) since that’s what a p-value of .05 means. That question bothered me until John Walsh pointed out the discrete nature of the binomial model when there are few trials (at-bats). It turns out that much of the reason has to do with the fact that of the sample of 30,481 matchups, over two-thirds consist of nine at-bats or fewer and just 3.5% include more than 20 at-bats. When there are so few trials, the probability of obtaining a p-value of less than .05 is actually less than .05 because the distribution is not a smooth curve. For example, in five at-bats you can calculate the following p-values for each number of hits given a probability of success of .266:
H p-value 0 0.2130 1 0.3860 2 0.2798 3 0.1014 4 0.0184 5 0.0013
Of these five possibilities, only where the batter gets four or five hits is the p-value less than .05. So given 1,000 matchups of five at-bats we would expect about one matchup of 5 for 5 (.13%) and 18 of 4 for 5 (1.8%) to be significant. In my study there were 7,328 matchups of five at-bats, so I would expect 10 matchups where the batter went 5 for 5 and 134 where the batter went 4 for 5. In reality I found five matchups of 5 for 5 and 145 of 4 for 5, six of which had p-values under .05 because the expected average was over .350. This tracks pretty well with what we’d expect and is an indication that the model works pretty well.
You can imagine, however, that in the real world the model may break down a bit at the extremes. We might postulate that this is due to the fact that both hitters and pitchers learn as they face one another repeatedly, which may well restrict the most extreme values on both ends. Also strategy (the reason LaRussa has his cards after all) dictates that extreme matchups be avoided both by the offense and defense through the use of relief specialists and pinch hitters.
Even at the .08 threshold that still means that on average less than one out of 17 of the matchups (given three years of data) written on LaRussa’s index cards each game are relevant in the sense that they may reveal information that the calculated expected average doesn’t (at least in terms of batting average). And as you can imagine, the problem only becomes worse when you consider that we’re using an expected average based on just three years’ worth of data, and that any batting average is merely an approximation of a hitter’s true ability (more on that later).
That said, let’s take a look at which matchups are considered the most statistically significant—in other words, those where we can make the argument that the model doesn’t hold and that there is something else going on that allows a particular pitcher to perform well against a particular hitter or vice versa.
First, we’ll take a look at the 25 most statistically significant matchups (lowest p-values) for “low-hit” matchups.
Batter Pitcher AB H Avg 3 Yr Avg ExAvg p-value Garret Anderson Brian Anderson 22 0 0.000 0.300 0.335 0.00013 Bill Mueller Mike Mussina 23 0 0.000 0.303 0.301 0.00027 Rondell White Jake Westbrook 19 0 0.000 0.289 0.288 0.00159 Alfonso Soriano John Lackey 26 1 0.038 0.280 0.285 0.00188 David Ortiz Bartolo Colon 18 0 0.000 0.297 0.285 0.00239 Carlos Lee Jeff Suppan 17 0 0.000 0.287 0.291 0.00287 Hideki Matsui Aaron Sele 14 0 0.000 0.297 0.336 0.00327 Bobby Abreu Mike Hampton 28 2 0.071 0.296 0.303 0.00345 Ivan Rodriguez Jon Garland 16 0 0.000 0.303 0.298 0.00351 Tony Graffanino Brian Anderson 21 1 0.048 0.281 0.315 0.00382 Mark Loretta Kirk Rueter 27 3 0.111 0.314 0.349 0.00532 Travis Hafner Bartolo Colon 15 0 0.000 0.295 0.284 0.00670 Edgar Renteria Rodrigo Lopez 13 0 0.000 0.297 0.311 0.00789 Carlos Lee Brian Anderson 33 4 0.121 0.287 0.320 0.00796 Rod Barajas Bartolo Colon 18 0 0.000 0.244 0.234 0.00833 Jim Edmonds Jason Jennings 13 0 0.000 0.280 0.308 0.00838 Mark Teixeira Aaron Sele 12 0 0.000 0.282 0.319 0.00989 Adrian Beltre Jason Schmidt 18 0 0.000 0.277 0.224 0.01053 Mark Loretta Matt Kinney 11 0 0.000 0.314 0.339 0.01060 Matt Lawton Mark Mulder 15 0 0.000 0.262 0.261 0.01066 Melvin Mora Gustavo Chacin 12 0 0.000 0.312 0.314 0.01087 Jay Gibbons Mark Hendrickson 12 0 0.000 0.270 0.307 0.01230 Scott Hatteberg Brian Shouse 15 0 0.000 0.265 0.254 0.01245 Vernon Wells Daniel Cabrera 14 0 0.000 0.288 0.267 0.01298 Aubrey Huff Tim Wakefield 26 2 0.077 0.290 0.275 0.01362
So given Garrett Anderson’s .300 batting average over the past three years and his expected average against Brian Anderson of .335, his 0 for 22 registered a probability of just .013%. In other words, if the model (Bernoulli trials and the expected average) perfectly mimicked real life, the odds of Anderson going 0 for 22 would be around 1 in 7,700. These kinds of odds lead us to say that in the battle of the Andersons, Brian very likely possesses some ability to get Garrett out (starting with his left-handedness of course). In other words the assumptions of our model probably don’t hold for this matchup.
As you can tell from the table, the higher the expected batting average, the fewer hitless at-bats it takes to make the list. Mark Loretta with his .339 expected average off of Matt Kinney makes the list with his 0 for 11, and the p-value associated with that matchup is lower than the 0 for 15 Matt Lawton recorded against Mark Mulder. The reason of course is that it is more unlikely for the .339 hitting Loretta to go 0 for 11 than it is for the .261 hitting Lawton. So while it seems paradoxical, generally speaking a manager might make the decision to pinch-hit for a good hitter based on the evidence of fewer at-bats than he would for an average or poor hitter.
You should also notice that almost all of these hitters have expected averages higher than the major league average (which was .266 over the three-year period) and in fact their cumulative average is .296. This is what one would expect since a higher average means that going hitless against a particular pitcher is less likely.
Of the 956 matchups that produced p-values less than .05, 204 of them were of the low-hit variety where the hitter recorded zero or only a few hits. And in looking at those 204, only one of the 430 had seven at-bats and none had fewer. So as common sense would dictate, a 0 for 6 or 0 for 7 probably isn’t a big enough sample on which to base decisions. The 0 for 7 with the lowest p-value was Sean Casey versus Ben Hendrickson, where Casey was expected to hit a whopping .356 off Hendrickson, who gave up an average of .310 over the past three years. That matchup just inched over the threshold at .0458.
Here are the significant low-hit matchups with the most at-bats were:
Batter Pitcher AB H Avg 3 Yr Avg ExAvg p-value Hank Blalock Joel Pineiro 35 5 0.143 0.279 0.281 0.04571 Carlos Lee Brian Anderson 33 4 0.121 0.287 0.320 0.00796 Alex Gonzalez Livan Hernandez 33 3 0.091 0.249 0.245 0.02402 Shawn Green Kirk Rueter 31 4 0.129 0.277 0.310 0.01824 Alex Rodriguez Sidney Ponson 30 5 0.167 0.302 0.331 0.03757 Joe Crede Brian Anderson 29 3 0.103 0.251 0.282 0.01979 Bobby Abreu Mike Hampton 28 2 0.071 0.296 0.303 0.00345 Chone Figgins Barry Zito 28 3 0.107 0.293 0.259 0.04445 Mark Loretta Kirk Rueter 27 3 0.111 0.314 0.349 0.00532 Mark Loretta Jason Jennings 27 4 0.148 0.314 0.343 0.02188
What’s more interesting, however, are those matchups that at first glance one might think are statistically significant but probably aren’t. The following list is a cluster of matchups with p-values around .20.
Batter Pitcher AB H Avg 3 Yr Avg ExAvg p-value Rocco Baldelli Jon Lieber 13 2 0.154 0.285 0.299 0.20406 Royce Clayton Paul Wilson 14 2 0.143 0.260 0.280 0.20419 Rich Aurilia Brandon Webb 11 1 0.091 0.269 0.247 0.20423 Coco Crisp Denny Bautista 8 1 0.125 0.290 0.328 0.20425 Frank Catalanotto Pedro Martinez 16 2 0.125 0.298 0.247 0.20464 Julio Franco Al Leiter 14 2 0.143 0.295 0.279 0.20466 Brian Schneider Josh Beckett 23 3 0.130 0.253 0.225 0.20469 Carlos Guillen Terry Mulholland 11 2 0.182 0.305 0.348 0.20488 Todd Helton Tom Martin 12 2 0.167 0.343 0.321 0.20492 John Mabry John Thomson 10 1 0.100 0.258 0.268 0.20503 Mike Matheny Kris Benson 16 2 0.125 0.247 0.247 0.20546 Nomar Garciaparra Brett Myers 9 1 0.111 0.299 0.295 0.20548 Jason Kendall Russ Ortiz 9 1 0.111 0.305 0.295 0.20568 Todd Helton Jesse Foppert 8 1 0.125 0.343 0.327 0.20569 Marcus Giles Kevin Millwood 9 1 0.111 0.305 0.295 0.20577 Wes Helms Kris Benson 10 1 0.100 0.268 0.268 0.20582 Jay Gibbons Jake Westbrook 10 1 0.100 0.270 0.268 0.20589 Morgan Ensberg Jerome Williams 10 1 0.100 0.283 0.268 0.20595
What these matchups reveal is that the common sense notion that a 1 for 11 or a 2 for 16 is enough to conclude that a particular hitter has trouble with a particular pitcher is often flawed. The above list shows that these kinds of performances don’t necessarily indicate that the hitter will continue to perform worse than his expected average. For example, the expected average of Todd Helton versus Tom Martin is .321, but Helton hit just .167 in his 12 at-bats from 2003 through 2005. The p-value of .205 may lead us to conclude that this is not enough evidence to assume that Helton is not really a .321 hitter against Martin since the odds of Helton getting just two hits in those 12 at-bats is one in five (once again, if the model holds). In other words, matchups like this probably aren’t in and of themselves enough on which to base a pinch-hitting decision.
As an aside this tracks very well with the wisdom of Earl Weaver in his book Weaver on Strategy where he said when talking about matchups that “Most of the time, I think a player needs around 20 at-bats before I can get a reading on him against a certain pitcher.”
But that doesn’t mean that these p-values aren’t low enough on which to base decisions in a game. For example, let’s say La Russa is trying to decide between two pinch-hitters and that they are close in overall ability. If one of them shows a good matchup (say a 4 for 8) with a p-value of 0.2, he would be remiss to simply disregard the data and go on a hunch because the hitter could turn out to be a .200 hitter against this pitcher as indicated by his expected average. The lower p-value leads one to believe that there is a decent chance that the player in question really does hit better than his expected average against that particular pitcher.
On the other end of the spectrum here are the 25 most statistically significant matchups where the hitters were very successful.
Batter Pitcher AB H Avg 3 Yr Avg ExAvg p-value Larry Bigbie Andy Pettitte 14 11 0.786 0.276 0.256 0.00005 Michael Young Brandon Backe 10 9 0.900 0.317 0.318 0.00024 Marcus Giles Jason Schmidt 14 10 0.714 0.305 0.248 0.00032 Preston Wilson Jae Seo 6 6 1.000 0.268 0.271 0.00040 Preston Wilson Byung-Hyun Kim 10 8 0.800 0.268 0.254 0.00046 Enrique Wilson Pedro Martinez 13 8 0.615 0.214 0.174 0.00046 Jose Reyes Jon Lieber 13 10 0.769 0.277 0.292 0.00050 Mark Grudzielanek Tim Hudson 6 6 1.000 0.304 0.286 0.00055 Derrek Lee Mark Mulder 15 11 0.733 0.295 0.294 0.00056 Matt Holliday Woody Williams 6 6 1.000 0.299 0.296 0.00067 Aubrey Huff Jon Lieber 13 10 0.769 0.290 0.305 0.00076 Todd Helton Damian Moss 7 7 1.000 0.343 0.367 0.00090 Clint Barmes Odalis Perez 12 9 0.750 0.289 0.282 0.00102 Reggie Sanders David Weathers 5 5 1.000 0.272 0.266 0.00133 David Bell Gary Majewski 7 6 0.857 0.253 0.251 0.00137 Charles Johnson Jake Peavy 6 5 0.833 0.230 0.197 0.00150 David Dellucci Kevin Brown 14 9 0.643 0.242 0.241 0.00161 Mark Kotsay Jamie Moyer 21 13 0.619 0.288 0.288 0.00162 Adrian Beltre Dontrelle Willis 9 7 0.778 0.277 0.264 0.00192 Brad Wilkerson Mike Matthews 7 6 0.857 0.257 0.268 0.00198 Matt LeCroy Nate Robertson 15 10 0.667 0.273 0.280 0.00207 Mark Sweeney Adam Eaton 11 8 0.727 0.277 0.271 0.00213 Jermaine Dye Jarrod Washburn 24 13 0.542 0.253 0.252 0.00228 Aaron Rowand Tim Wakefield 9 7 0.778 0.288 0.272 0.00232 Jeff Cirillo Javier Vazquez 6 5 0.833 0.234 0.218 0.00243
If you look closely you’ll also notice that five of the 25 slots (six if you count Larry Bigbie who had 66 at-bats with the Rockies in 2005) are filled by players who have spent time with the Rockies. Given the way in which Coors Field inflates offense Matt Holliday’s 6 for 6 against Woody Williams and Clint Barmes going 9 for 12 off Odalis Perez are likely park influenced as well. As a result we could make adjustments to the model to take park factors into account.
You’ll also notice that hitters don’t need as many plate appearances to make this list. In other words, it requires relatively few consecutive hits to reach a statistically significant result. So while Weaver may generally need around 20 at-bats, a 5 for 6 or 6 for 7 is likely almost always enough to “get a reading.”
Overall 752 of the 956 significant matchups were of the high-hit variety with the following having the most at-bats:
Batter Pitcher AB H Avg 3 Yr Avg ExAvg p-value Derek Jeter Rodrigo Lopez 40 19 0.475 0.307 0.321 0.03009 Jack Wilson Ben Sheets 32 13 0.406 0.275 0.253 0.04100 Todd Helton Odalis Perez 31 16 0.516 0.343 0.334 0.02790 Eric Chavez Jamie Moyer 30 13 0.433 0.275 0.276 0.04598 Jimmy Rollins Carl Pavano 29 13 0.448 0.281 0.285 0.04421 Bobby Kielty Mark Buehrle 27 11 0.407 0.244 0.248 0.04958 Todd Walker Roger Clemens 27 11 0.407 0.287 0.239 0.03964 Eric Byrnes Mark Buehrle 27 12 0.444 0.260 0.264 0.03282 Trot Nixon Roy Halladay 27 13 0.481 0.295 0.275 0.01751 Johnny Damon Bartolo Colon 26 12 0.462 0.298 0.286 0.04313
Wrapping It Up
So is Ramirez’s performance against Carpenter significant? Ramirez hit .296 over the past three years, while Carpenter had a batting average against the league of .237. Given the major league average, Ramirez would have been expected to hit .264. He actually hit .231 (6-26), so the p-value is a healthy .448, which means that Ramirez may in fact be a .264 hitter against Carpenter. The fact that Ramirez hit 30 points lower than he “should” have could very well be chalked up to chance.
While all of this is interesting, there are many issues that serve to cloud the picture which I haven’t explored, and which, if you’re still reading this article, you’re probably more suited to pursue than I. In addition to augmenting the model to take into account park effects, some hitters hit better against fly ball versus ground ball pitchers and vice versa, so their probability of success should go up or down depending on the pitcher’s profile. Of course the same story can be said of platoon effects.
Further, as hinted at previously, batting average is highly variable from season to season—a fact Albert demonstrated in a 2004 article titled “A Batting Average: Does It Represent Ability or Luck?” . In that article Albert concludes that measures such as strikeout rate, walk rate, home run rate and on-base percentage are all much more strongly correlated from year to year than batting average on balls in play (removing the effect of strikeouts) as well as batting average itself. In short, as much as 50% of the difference in batting average between players can be attributed to luck while 50% can be attributed to differences in their hitting ability. As a result, a measure like slugging percentage or OPS would probably be a better candidate for this kind of study, although it would require using a different kind of model.
References & Resources