Batters and BABIP

by Chris Dutton
December 2, 2008

What we did

Batting average on balls in play—the rate at which batted balls other than home runs become hits—is commonly used as a measure of pitching performance. However, precious little work has been done to explore BABIP from the hitter’s perspective. While luck is bound to play a large role in determining whether a ball in play will become a hit or an out, there are certainly some quantifiable aspects of hitting ability that give a batter at least some control of the outcome.

Some people like to add .120 to a batter’s Line Drive Percentage to predict his BABIP (a guideline originally suggested by Dave Studeman). But one would expect that BABIP depends on more than just the ability to hit line drives. Speed, for instance, clearly seems to play a significant role. And what about the ability to control the strike zone, make consistent contact and hit the ball to all fields?

For example, if Jacoby Ellsbury hits a ground ball in the hole between short and third, he has a higher chance of getting a hit than if Bengie Molina hits the exact same ball in the exact same place. Anecdotally, this is how Ichiro manages to get so many hits every year. And fans of the Red Sox, Yankees and Rays can tell you that David Ortiz, Jason Giambi and Carlos Pena have been robbed of many a base hit because of the extreme defensive shifts used against them, whereas Dustin Pedroia, Derek Jeter and BJ Upton have gotten more hits because of their batting eye and their ability to use the whole field. Surely, these factors contribute to whether or not a batted ball becomes a hit.

We endeavored to take a more scientific look at batted ball data to develop a better method of finding a hitter’s expected BABIP. Using Baseball Prospectus data from 2002-2008, we calculated a range of variables that we considered to be the primary factors in determining BABIP:

Variable	Description
BABIP	Batting average on balls in play, calculated as non-homerun hits divided by balls in play ((h-hr)/(pa-so-bb-hr)).
Hitter_Eye	A measure of plate discipline and knowledge of the strike zone, calculated as (BB rate/SO rate).
Pitches_perEBH	Pitches per extra base hit, which is a measure of how often a hitter makes solid contact (pitches/(doub+trip+hr)).
LD_per	Line drive percentage, as defined by MLB Advanced Media and provided by Baseball Prospectus.
FB_GB_ratio	Fly ball/ground ball ratio, using percentages provided by Baseball Prospectus.
Speed Score	A comprehensive measure of speed, developed by Bill James. The speed score is the average of five individual formulas based on stolen base percentage, stolen base attempts, triples, runs per time on base and double plays.
Contact_Rate	A measure of the ability to make contact and avoid striking out, simply calculated as ((ab-so)/ab).
Spray	Measure of how well a hitter distributes balls in play to the entire field. Calculated as \|1(LF%) + -1(RF%)\|.
Pitches	A hitter’s average number of pitches per plate appearance, to account for patience and selectiveness at the plate.
Park	A vector of binary stadium variables, to account for the influence of park effects on BABIP.
Year	A vector of year variables from 2002 through 2007, to account for potential time effects.
Lefty	A binary variable equal to “1” if the hitter is a lefty, “0” otherwise.
Switch	A binary variable equal to “1” if the player is a switch hitter, “0” otherwise.

Using this dataset, we designed a regression model to determine the relationship between each factor and a hitter’s BABIP. Essentially, the model takes seven years worth of data and compresses it into a single formula that inputs the variables above and spits out a predicted BABIP. Using this, we can compare players’ actual and predicted BABIP to identify instances in which a player significantly outperformed or underperformed his expectations. Furthermore, we can use the model to strip luck from the equation and calculate a “luck-neutral” measure of BABIP.

Our regression model yields an R-squared value of .348, and all non-vector explanatory variables are significant at the 1 percent level. This suggests that the factors included are all highly significant, and jointly explain roughly 35 percent of the variance in a hitter’s BABIP. As an additional test of accuracy, we find a robust 59 percent correlation between actual and predicted BABIP for all players in our sample.

Given the tremendous uncertainty regarding the outcome of balls in play, these results are extremely promising. By contrast, commonly used models based on line drive percentage alone explain only about 3 percent of the variance in BABIP when applied to the same dataset, and yield a mere 18 percent correlation between predicted and actual values.

As mentioned above, all of our key independent variables are statistically significant at the 1 percent level. That is to say, there is virtually no chance that the effects reflected in this model are the product of random chance. Our regression results show positive effects for hitter’s eye, line drive percentage, speed score and pitches per plate appearance, all of which conform to common sense. On the other hand, we find negative coefficients on pitches per extra-base hit, fly ball/ground ball ratio, spray and contact rate.

One might expect a higher contact rate to lead to a higher BABIP, but the opposite actually seems to be the case. This is likely caused by the correlation between strikeouts and power, since players who swing hard tend to either miss entirely or crush the ball for hits. If this theory is reflected in our data, it makes sense that we would expect a player with a lower contact rate to generate a higher predicted BABIP. This is consistent with Studeman’s follow-up work on BABIP.

What does it mean?

Okay, now you know what we did. Let’s discuss what it means.

We’ve developed a new and better way of finding a batter’s expected BABIP. We will call our model’s predicted BABIP “xBABIP,” in contrast to the old way of calculating BABIP, which was LD% + .120. We will refer to this old model of calculating expected BABIP as “old-xBABIP.”

The idea is to separate skill from variance. We’ve isolated a batter’s skill at getting hits on balls in play; therefore, we can assume that most deviation in BABIP from our model’s predicted BABIP is likely due to random fluctuation, and therefore unlikely to be repeated.

We can actually test this theory by looking to the past. Let’s examine the players whose actual BABIPs differed most from their xBABIPs in 2007 (the expected BABIP as predicted by our model), and then look at what happened in 2008. Our hypothesis is that these players shouldn’t consistently under/over-perform their xBABIP.

Let’s start with players who were “unlucky” in 2007.

  YEAR    NAME             BABIP xBABIP  Diff  YEAR  NAME            BABIP xBABIP  Diff
  2007  Ramon Vazquez      .258    .322 -.063  2008  Ramon Vazquez    .342   .322  .020
  2007  John Buck          .233    .283 -.049  2008  John Buck        .269   .282 -.013
  2007  Bobby Crosby       .253    .303 -.050  2008  Bobby Crosby     .275   .276 -.001
  2007  Julio Lugo         .258    .309 -.051  2008  Julio Lugo       .312   .284  .029
  2007  Ray Durham         .231    .276 -.044  2008  Ray Durham       .342   .313  .029
  2007  Lyle Overbay       .270    .321 -.051  2008  Lyle Overbay     .311   .310  .001
  2007  Rickie Weeks       .270    .321 -.050  2008  Rickie Weeks     .266   .294 -.028
  2007  Dioner Navarro     .243    .286 -.043  2008  Dioner Navarro   .313   .303  .010
  2007  Brad Wilkerson     .269    .316 -.046  2008  Brad Wilkerson   .267   .279 -.011
  2007  Jay Payton         .261    .299 -.038  2008  Jay Payton       .266   .304 -.038
  2007  Adam Lind          .265    .303 -.038  2008  Adam Lind        .313   .302  .011
  2007  Ian Kinsler        .267    .305 -.038  2008  Ian Kinsler      .325   .295  .030
  2007  Nick Punto         .251    .285 -.034  2008  Nick Punto       .331   .304  .027
  2007  Dan Uggla          .268    .304 -.036  2008  Dan Uggla        .313   .294  .018

Wow—that’s pretty compelling evidence for the model. We didn’t cherry-pick these, either—these were the “unluckiest” hitters of 2007 who also had enough plate appearances to qualify for our model in 2008. Only Rickie Weeks and Jay Payton saw their actual BABIP remain below their xBABIP in 2008, while everyone else had a 2008 BABIP that was either very close to their xBABIP, or above it. Had we seen these numbers after 2007, we may have been able to predict the rise of Vazquez, Navarro, Lind, Kinsler and Uggla—all of whom seemingly “came out of nowhere” in 2008.

A Hardball Times Update

by RJ McDaniel

Goodbye for now.

And what about hitters who were particularly lucky in 2007?

  YEAR    NAME             BABIP  xBABIP Diff  YEAR   NAME           BABIP xBABIP  Diff
  2007  Matt Kemp          .411    .301  .110  2008  Matt Kemp        .359   .312  .047
  2007  Ichiro Suzuki      .384    .317  .067  2008  Ichiro Suzuki    .330   .307  .023
  2007  Willy Taveras      .355    .293  .062  2008  Willy Taveras    .282   .292 -.010
  2007  Magglio Ordonez    .379    .315  .064  2008  Magglio Ordone   .331   .303  .028
  2007  Howie Kendrick     .374    .314  .060  2008  Howie Kendrick   .351   .316  .035
  2007  Jayson Werth       .380    .322  .058  2008  Jayson Werth     .319   .314  .005
  2007  Mark Reynolds      .368    .313  .055  2008  Mark Reynolds    .319   .304  .014
  2007  Edgar Renteria     .373    .319  .053  2008  Edgar Renteria   .289   .301 -.012
  2007  Mike Lowell        .335    .288  .047  2008  Mike Lowell      .278   .282 -.003
  2007  Ryan Braun         .353    .304  .050  2008  Ryan Braun       .301   .287  .014
  2007  David Ortiz        .352    .306  .046  2008  David Ortiz      .269   .302 -.033
  2007  Jose Vidro         .333    .290  .043  2008  Jose Vidro       .242   .290 -.048
  2007  B.j. Upton         .387    .338  .048  2008  B.j. Upton       .340   .340  .000
  2007  Luis Castillo      .318    .284  .034  2008  Luis Castillo    .258   .245  .013

Again good results, although more mixed. Kemp, Ichiro and Kendrick again significantly beat their xBABIP in 2008. Interestingly, Ichiro and Kendrick are both known to be unique hitters. Does Matt Kemp do anything differently than most other hitters?

But nearly all of the “lucky” players in 2007 regressed in 2008. The model predicted the downfall of Renteria, Taveras, Vidro (although he was also quite unlucky in 08) and Castillo. It correctly predicted a return-to-earth for Upton, Reynolds, Ortiz, Braun and Lowell.

Next, let’s look at hitters for whom xBABIP disagreed strongly with old-xBABIP. Here are the top cases where old-xBABIP overrated players in 2008:

  YEAR  NAME                BABIP        xBABIP   old-xBABIP
  2008  Brian Schneider     .275         .289         .355
  2008  Ryan Ludwick        .333         .325         .404
  2008  Kevin Millar        .244         .257         .311
  2008  Jesus Flores        .309         .293         .353
  2008  Omar Infante        .323         .294         .356
  2008  Joey Gathright      .278         .156         .215
  2008  Jose Lopez          .302         .286         .346
  2008  Khalil Greene       .251         .276         .336
  2008  Cesar Izturis       .271         .286         .344
  2008  Todd Helton         .296         .312         .369
  2008  John Bowker         .296         .308         .364
  2008  Damion Easley       .278         .262         .317
  2008  Paul Konerko        .243         .280         .330
  2008  Clint Barmes        .322         .308         .360
  2008  Freddy Sanchez      .282         .304         .356
  2008  Jack Wilson         .284         .299         .348
  2008  Omar Vizquel        .239         .277         .326
  2008  Dioner Navarro      .313         .303         .352
  2008  Xavier Nady         .327         .321         .370

For these players, the old guideline would lead you to believe that the players had been rather unlucky this season. However, our new model shows that these players were far less unlucky than previously thought. In other words, simply using line-drive percentage to predict BABIP overrated these players.

And the players that were most underrated by the old model:

  YEAR  NAME                BABIP        xBABIP   old-xBABIP
  2008  Gary Matthews Jr.   .289         .307         .252
  2008  Hunter Pence        .298         .290         .236
  2008  Jeff Mathis         .231         .269         .217
  2008  Alexi Casilla       .288         .281         .235
  2008  Fred Lewis          .365         .336         .293
  2008  Carlos Gomez        .324         .301         .260
  2008  Delmon Young        .334         .306         .268
  2008  Nick Punto          .331         .304         .267
  2008  Jacoby Ellsbury     .305         .326         .290
  2008  Lance Berkman       .336         .309         .273
  2008  Rickie Weeks        .266         .294         .260
  2008  Denard Span         .328         .338         .306
  2008  Michael Bourn       .283         .277         .246
  2008  Yunel Escobar       .303         .296         .265
  2008  Erick Aybar         .297         .304         .274
  2008  Brendan Harris      .312         .297         .273
  2008  Jason Varitek       .270         .295         .272
  2008  Coco Crisp          .308         .321         .298
  2008  Howie Kendrick      .351         .316         .294

For the most part, our model believes these players’ actual BABIP are closer in line with expectations than the old model’s xBABIP. In other words, old-xBABIP may think that Alexi Casilla got lucky, but our model suggests he hit in line with expectations. Simply using line-drive percentage to predict BABIP underrated these players.

Finally, let’s take a look at the players who were the most lucky and unlucky this season. We’d expect that many of these players will regress in 2009—not necessarily all are going to, as some are simply going to get lucky or unlucky again. However, we can be confident that most of these players will experience regression in ’09.

Let’s start with 2008’s luckiest hitters:

  YEAR  NAME                BABIP   xBABIP   Diff
  2008  Joey Gathright       .278    .156    .122
  2008  Chipper Jones        .382    .325    .058
  2008  Matt Kemp            .359    .312    .047
  2008  Ryan Theriot         .335    .291    .044
  2008  Felipe Lopez         .324    .287    .037
  2008  Milton Bradley       .375    .334    .041
  2008  Aaron Miles          .337    .301    .037
  2008  Yadier Molina        .307    .274    .033
  2008  Shin-soo Choo        .359    .320    .039
  2008  Geovany Soto         .331    .295    .036
  2008  Mike Aviles          .355    .317    .038
  2008  Reed Johnson         .338    .302    .036
  2008  Jason Bay            .318    .285    .033
  2008  Chone Figgins        .329    .295    .034
  2008  Chase Headley        .356    .319    .036
  2008  Howie Kendrick       .351    .316    .035
  2008  Edgar V Gonzalez     .335    .302    .033
  2008  Ryan Doumit          .328    .297    .031
  2008  Manny Ramirez        .360    .326    .034
  2008  Aaron Rowand         .318    .288    .029

Unsurprisingly, this list includes a lot of 2008’s surprises—Bradley, Miles, Aviles, Doumit, Choo, Lopez. Interestingly, Gathright’s xBABIP of .156 was nearly 90 points lower than the next-closest person (and remember, we do take speed into account in the model). Maybe Geovany Soto isn’t quite this good. Perhaps Manny Ramirez and Milton Bradley will disappoint whoever signs them. The Cardinals’ middle infielders aren’t as good as they seemed.

And 2008’s unluckiest hitters:

  YEAR  NAME                BABIP   xBABIP   Diff
  2008  Brandon Inge         .229    .292   -.063
  2008  Corey Patterson      .210    .262   -.051
  2008  Carlos Ruiz          .230    .282   -.052
  2008  Willy Aybar          .261    .314   -.054
  2008  Jason Giambi         .234    .282   -.048
  2008  Nick Swisher         .245    .294   -.049
  2008  Jose Vidro           .242    .290   -.048
  2008  Kenji Johjima        .226    .266   -.040
  2008  Austin Kearns        .242    .284   -.042
  2008  Jeff Mathis          .231    .269   -.038
  2008  Omar Vizquel         .239    .277   -.038
  2008  Adrian Beltre        .275    .319   -.044
  2008  Mike Jacobs          .259    .300   -.040
  2008  Paul Konerko         .243    .280   -.038
  2008  Brandon Boggs        .296    .342   -.046
  2008  Jim Edmonds          .246    .283   -.037
  2008  Eric Hinske          .267    .306   -.038
  2008  Willie Harris        .268    .306   -.038
  2008  Jay Payton           .266    .304   -.038
  2008  Gabe Gross           .272    .308   -.036

Some team is going to get a steal in Jason Giambi. Willy Aybar is deserving of full-time action. Nick Swisher and Austin Kearns are a lot better than they showed in 08. Would you believe that Brandon Boggs had the highest xBABIP in 2008 of any player in our database? Jim Edmonds may not be done quite yet. Adrian Beltre is very underrated.

While our model cannot explain all of the variation in BABIP, we believe that it is an improvement over current explanations of BABIP, as it takes into account many factors that influence a hitter’s BABIP. By finding players who over- and under-performed their expected BABIP, we can further isolate skill from luck, and infer that players such as Mike Aviles are likely to regress and player such as Nick Swisher are likely to improve.

Here’s a download of our results in an Excel file.

References & Resources
We owe a tremendous amount of thanks to Leanne, Dave, Jeremy, Steven and Kevin, who actively conducted the research with us, as part of Baseball Analysis at Tufts’ (BAT) Research Committee. The Committee, headed by Dutton, met once a week throughout the 2007/2008 academic year to discuss and conduct research, as well as analyze the results.

BAT, founded by Bendix and Matt Gallagher in 2005, is the first baseball analysis club on a college campus. It has hosted such speakers as Bill James, Alan Schwarz, Keith Law, sportswriters from the Boston Globe and more. It continues to host various speakers and events, as well as provide a forum for intelligent baseball discussion and research on the Tufts campus. For more information, please contact Peter Bendix at
or Chris Dutton at
.

2 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Sam

10 years ago

Where did you get your batted ball spray data. Did you have to mine it off of mlbam?

David

9 years ago

I am only now teaching myself about BABIP and xBABIP. The reason is because I have a fascination about doubles versus home runs, but that’s another story.

Send me an email if you guys are still alive!
David

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG