# Batters and BABIP

### What we did

Batting average on balls in play—the rate at which batted balls other than home runs become hits—is commonly used as a measure of pitching performance. However, precious little work has been done to explore BABIP from the hitter’s perspective. While luck is bound to play a large role in determining whether a ball in play will become a hit or an out, there are certainly some quantifiable aspects of hitting ability that give a batter at least some control of the outcome.

Some people like to add .120 to a batter’s Line Drive Percentage to predict his BABIP (a guideline originally suggested by Dave Studeman). But one would expect that BABIP depends on more than just the ability to hit line drives. Speed, for instance, clearly seems to play a significant role. And what about the ability to control the strike zone, make consistent contact and hit the ball to all fields?

For example, if Jacoby Ellsbury hits a ground ball in the hole between short and third, he has a higher chance of getting a hit than if Bengie Molina hits the exact same ball in the exact same place. Anecdotally, this is how Ichiro manages to get so many hits every year. And fans of the Red Sox, Yankees and Rays can tell you that David Ortiz, Jason Giambi and Carlos Pena have been robbed of many a base hit because of the extreme defensive shifts used against them, whereas Dustin Pedroia, Derek Jeter and BJ Upton have gotten more hits because of their batting eye and their ability to use the whole field. Surely, these factors contribute to whether or not a batted ball becomes a hit.

We endeavored to take a more scientific look at batted ball data to develop a better method of finding a hitter’s expected BABIP. Using Baseball Prospectus data from 2002-2008, we calculated a range of variables that we considered to be the primary factors in determining BABIP:

 Variable Description BABIP Batting average on balls in play, calculated as non-homerun hits divided by balls in play ((h-hr)/(pa-so-bb-hr)). Hitter_Eye A measure of plate discipline and knowledge of the strike zone, calculated as (BB rate/SO rate). Pitches_perEBH Pitches per extra base hit, which is a measure of how often a hitter makes solid contact (pitches/(doub+trip+hr)). LD_per Line drive percentage, as defined by MLB Advanced Media and provided by Baseball Prospectus. FB_GB_ratio Fly ball/ground ball ratio, using percentages provided by Baseball Prospectus. Speed Score A comprehensive measure of speed, developed by Bill James. The speed score is the average of five individual formulas based on stolen base percentage, stolen base attempts, triples, runs per time on base and double plays. Contact_Rate A measure of the ability to make contact and avoid striking out, simply calculated as ((ab-so)/ab). Spray Measure of how well a hitter distributes balls in play to the entire field. Calculated as |1(LF%) + -1(RF%)|. Pitches A hitter’s average number of pitches per plate appearance, to account for patience and selectiveness at the plate. Park A vector of binary stadium variables, to account for the influence of park effects on BABIP. Year A vector of year variables from 2002 through 2007, to account for potential time effects. Lefty A binary variable equal to “1” if the hitter is a lefty, “0” otherwise. Switch A binary variable equal to “1” if the player is a switch hitter, “0” otherwise.

Using this dataset, we designed a regression model to determine the relationship between each factor and a hitter’s BABIP. Essentially, the model takes seven years worth of data and compresses it into a single formula that inputs the variables above and spits out a predicted BABIP. Using this, we can compare players’ actual and predicted BABIP to identify instances in which a player significantly outperformed or underperformed his expectations. Furthermore, we can use the model to strip luck from the equation and calculate a “luck-neutral” measure of BABIP.

Our regression model yields an R-squared value of .348, and all non-vector explanatory variables are significant at the 1 percent level. This suggests that the factors included are all highly significant, and jointly explain roughly 35 percent of the variance in a hitter’s BABIP. As an additional test of accuracy, we find a robust 59 percent correlation between actual and predicted BABIP for all players in our sample.

Given the tremendous uncertainty regarding the outcome of balls in play, these results are extremely promising. By contrast, commonly used models based on line drive percentage alone explain only about 3 percent of the variance in BABIP when applied to the same dataset, and yield a mere 18 percent correlation between predicted and actual values.

As mentioned above, all of our key independent variables are statistically significant at the 1 percent level. That is to say, there is virtually no chance that the effects reflected in this model are the product of random chance. Our regression results show positive effects for hitter’s eye, line drive percentage, speed score and pitches per plate appearance, all of which conform to common sense. On the other hand, we find negative coefficients on pitches per extra-base hit, fly ball/ground ball ratio, spray and contact rate.

One might expect a higher contact rate to lead to a higher BABIP, but the opposite actually seems to be the case. This is likely caused by the correlation between strikeouts and power, since players who swing hard tend to either miss entirely or crush the ball for hits. If this theory is reflected in our data, it makes sense that we would expect a player with a lower contact rate to generate a higher predicted BABIP. This is consistent with Studeman’s follow-up work on BABIP.

### What does it mean?

Okay, now you know what we did. Let’s discuss what it means.

We’ve developed a new and better way of finding a batter’s expected BABIP. We will call our model’s predicted BABIP “xBABIP,” in contrast to the old way of calculating BABIP, which was LD% + .120. We will refer to this old model of calculating expected BABIP as “old-xBABIP.”

The idea is to separate skill from variance. We’ve isolated a batter’s skill at getting hits on balls in play; therefore, we can assume that most deviation in BABIP from our model’s predicted BABIP is likely due to random fluctuation, and therefore unlikely to be repeated.

We can actually test this theory by looking to the past. Let’s examine the players whose actual BABIPs differed most from their xBABIPs in 2007 (the expected BABIP as predicted by our model), and then look at what happened in 2008. Our hypothesis is that these players shouldn’t consistently under/over-perform their xBABIP.

```  YEAR    NAME             BABIP xBABIP  Diff  YEAR  NAME            BABIP xBABIP  Diff
2007  Ramon Vazquez      .258    .322 -.063  2008  Ramon Vazquez    .342   .322  .020
2007  John Buck          .233    .283 -.049  2008  John Buck        .269   .282 -.013
2007  Bobby Crosby       .253    .303 -.050  2008  Bobby Crosby     .275   .276 -.001
2007  Julio Lugo         .258    .309 -.051  2008  Julio Lugo       .312   .284  .029
2007  Ray Durham         .231    .276 -.044  2008  Ray Durham       .342   .313  .029
2007  Lyle Overbay       .270    .321 -.051  2008  Lyle Overbay     .311   .310  .001
2007  Rickie Weeks       .270    .321 -.050  2008  Rickie Weeks     .266   .294 -.028
2007  Dioner Navarro     .243    .286 -.043  2008  Dioner Navarro   .313   .303  .010
2007  Brad Wilkerson     .269    .316 -.046  2008  Brad Wilkerson   .267   .279 -.011
2007  Jay Payton         .261    .299 -.038  2008  Jay Payton       .266   .304 -.038
2007  Adam Lind          .265    .303 -.038  2008  Adam Lind        .313   .302  .011
2007  Ian Kinsler        .267    .305 -.038  2008  Ian Kinsler      .325   .295  .030
2007  Nick Punto         .251    .285 -.034  2008  Nick Punto       .331   .304  .027
2007  Dan Uggla          .268    .304 -.036  2008  Dan Uggla        .313   .294  .018```

Wow—that’s pretty compelling evidence for the model. We didn’t cherry-pick these, either—these were the “unluckiest” hitters of 2007 who also had enough plate appearances to qualify for our model in 2008. Only Rickie Weeks and Jay Payton saw their actual BABIP remain below their xBABIP in 2008, while everyone else had a 2008 BABIP that was either very close to their xBABIP, or above it. Had we seen these numbers after 2007, we may have been able to predict the rise of Vazquez, Navarro, Lind, Kinsler and Uggla—all of whom seemingly “came out of nowhere” in 2008.

And what about hitters who were particularly lucky in 2007?

```  YEAR    NAME             BABIP  xBABIP Diff  YEAR   NAME           BABIP xBABIP  Diff
2007  Matt Kemp          .411    .301  .110  2008  Matt Kemp        .359   .312  .047
2007  Ichiro Suzuki      .384    .317  .067  2008  Ichiro Suzuki    .330   .307  .023
2007  Willy Taveras      .355    .293  .062  2008  Willy Taveras    .282   .292 -.010
2007  Magglio Ordonez    .379    .315  .064  2008  Magglio Ordone   .331   .303  .028
2007  Howie Kendrick     .374    .314  .060  2008  Howie Kendrick   .351   .316  .035
2007  Jayson Werth       .380    .322  .058  2008  Jayson Werth     .319   .314  .005
2007  Mark Reynolds      .368    .313  .055  2008  Mark Reynolds    .319   .304  .014
2007  Edgar Renteria     .373    .319  .053  2008  Edgar Renteria   .289   .301 -.012
2007  Mike Lowell        .335    .288  .047  2008  Mike Lowell      .278   .282 -.003
2007  Ryan Braun         .353    .304  .050  2008  Ryan Braun       .301   .287  .014
2007  David Ortiz        .352    .306  .046  2008  David Ortiz      .269   .302 -.033
2007  Jose Vidro         .333    .290  .043  2008  Jose Vidro       .242   .290 -.048
2007  B.j. Upton         .387    .338  .048  2008  B.j. Upton       .340   .340  .000
2007  Luis Castillo      .318    .284  .034  2008  Luis Castillo    .258   .245  .013```

Again good results, although more mixed. Kemp, Ichiro and Kendrick again significantly beat their xBABIP in 2008. Interestingly, Ichiro and Kendrick are both known to be unique hitters. Does Matt Kemp do anything differently than most other hitters?

But nearly all of the “lucky” players in 2007 regressed in 2008. The model predicted the downfall of Renteria, Taveras, Vidro (although he was also quite unlucky in 08) and Castillo. It correctly predicted a return-to-earth for Upton, Reynolds, Ortiz, Braun and Lowell.

Next, let’s look at hitters for whom xBABIP disagreed strongly with old-xBABIP. Here are the top cases where old-xBABIP overrated players in 2008:

```  YEAR  NAME                BABIP        xBABIP   old-xBABIP
2008  Brian Schneider     .275         .289         .355
2008  Ryan Ludwick        .333         .325         .404
2008  Kevin Millar        .244         .257         .311
2008  Jesus Flores        .309         .293         .353
2008  Omar Infante        .323         .294         .356
2008  Joey Gathright      .278         .156         .215
2008  Jose Lopez          .302         .286         .346
2008  Khalil Greene       .251         .276         .336
2008  Cesar Izturis       .271         .286         .344
2008  Todd Helton         .296         .312         .369
2008  John Bowker         .296         .308         .364
2008  Damion Easley       .278         .262         .317
2008  Paul Konerko        .243         .280         .330
2008  Clint Barmes        .322         .308         .360
2008  Freddy Sanchez      .282         .304         .356
2008  Jack Wilson         .284         .299         .348
2008  Omar Vizquel        .239         .277         .326
2008  Dioner Navarro      .313         .303         .352
2008  Xavier Nady         .327         .321         .370```

For these players, the old guideline would lead you to believe that the players had been rather unlucky this season. However, our new model shows that these players were far less unlucky than previously thought. In other words, simply using line-drive percentage to predict BABIP overrated these players.

And the players that were most underrated by the old model:

```  YEAR  NAME                BABIP        xBABIP   old-xBABIP
2008  Gary Matthews Jr.   .289         .307         .252
2008  Hunter Pence        .298         .290         .236
2008  Jeff Mathis         .231         .269         .217
2008  Alexi Casilla       .288         .281         .235
2008  Fred Lewis          .365         .336         .293
2008  Carlos Gomez        .324         .301         .260
2008  Delmon Young        .334         .306         .268
2008  Nick Punto          .331         .304         .267
2008  Jacoby Ellsbury     .305         .326         .290
2008  Lance Berkman       .336         .309         .273
2008  Rickie Weeks        .266         .294         .260
2008  Denard Span         .328         .338         .306
2008  Michael Bourn       .283         .277         .246
2008  Yunel Escobar       .303         .296         .265
2008  Erick Aybar         .297         .304         .274
2008  Brendan Harris      .312         .297         .273
2008  Jason Varitek       .270         .295         .272
2008  Coco Crisp          .308         .321         .298
2008  Howie Kendrick      .351         .316         .294```

For the most part, our model believes these players’ actual BABIP are closer in line with expectations than the old model’s xBABIP. In other words, old-xBABIP may think that Alexi Casilla got lucky, but our model suggests he hit in line with expectations. Simply using line-drive percentage to predict BABIP underrated these players.

Finally, let’s take a look at the players who were the most lucky and unlucky this season. We’d expect that many of these players will regress in 2009—not necessarily all are going to, as some are simply going to get lucky or unlucky again. However, we can be confident that most of these players will experience regression in ’09.

```  YEAR  NAME                BABIP   xBABIP   Diff
2008  Joey Gathright       .278    .156    .122
2008  Chipper Jones        .382    .325    .058
2008  Matt Kemp            .359    .312    .047
2008  Ryan Theriot         .335    .291    .044
2008  Felipe Lopez         .324    .287    .037
2008  Milton Bradley       .375    .334    .041
2008  Aaron Miles          .337    .301    .037
2008  Yadier Molina        .307    .274    .033
2008  Shin-soo Choo        .359    .320    .039
2008  Geovany Soto         .331    .295    .036
2008  Mike Aviles          .355    .317    .038
2008  Reed Johnson         .338    .302    .036
2008  Jason Bay            .318    .285    .033
2008  Chone Figgins        .329    .295    .034
2008  Chase Headley        .356    .319    .036
2008  Howie Kendrick       .351    .316    .035
2008  Edgar V Gonzalez     .335    .302    .033
2008  Ryan Doumit          .328    .297    .031
2008  Manny Ramirez        .360    .326    .034
2008  Aaron Rowand         .318    .288    .029```

Unsurprisingly, this list includes a lot of 2008’s surprises—Bradley, Miles, Aviles, Doumit, Choo, Lopez. Interestingly, Gathright’s xBABIP of .156 was nearly 90 points lower than the next-closest person (and remember, we do take speed into account in the model). Maybe Geovany Soto isn’t quite this good. Perhaps Manny Ramirez and Milton Bradley will disappoint whoever signs them. The Cardinals’ middle infielders aren’t as good as they seemed.

And 2008’s unluckiest hitters:

```  YEAR  NAME                BABIP   xBABIP   Diff
2008  Brandon Inge         .229    .292   -.063
2008  Corey Patterson      .210    .262   -.051
2008  Carlos Ruiz          .230    .282   -.052
2008  Willy Aybar          .261    .314   -.054
2008  Jason Giambi         .234    .282   -.048
2008  Nick Swisher         .245    .294   -.049
2008  Jose Vidro           .242    .290   -.048
2008  Kenji Johjima        .226    .266   -.040
2008  Austin Kearns        .242    .284   -.042
2008  Jeff Mathis          .231    .269   -.038
2008  Omar Vizquel         .239    .277   -.038
2008  Adrian Beltre        .275    .319   -.044
2008  Mike Jacobs          .259    .300   -.040
2008  Paul Konerko         .243    .280   -.038
2008  Brandon Boggs        .296    .342   -.046
2008  Jim Edmonds          .246    .283   -.037
2008  Eric Hinske          .267    .306   -.038
2008  Willie Harris        .268    .306   -.038
2008  Jay Payton           .266    .304   -.038
2008  Gabe Gross           .272    .308   -.036```

Some team is going to get a steal in Jason Giambi. Willy Aybar is deserving of full-time action. Nick Swisher and Austin Kearns are a lot better than they showed in 08. Would you believe that Brandon Boggs had the highest xBABIP in 2008 of any player in our database? Jim Edmonds may not be done quite yet. Adrian Beltre is very underrated.

While our model cannot explain all of the variation in BABIP, we believe that it is an improvement over current explanations of BABIP, as it takes into account many factors that influence a hitter’s BABIP. By finding players who over- and under-performed their expected BABIP, we can further isolate skill from luck, and infer that players such as Mike Aviles are likely to regress and player such as Nick Swisher are likely to improve.

References & Resources
We owe a tremendous amount of thanks to Leanne, Dave, Jeremy, Steven and Kevin, who actively conducted the research with us, as part of Baseball Analysis at Tufts’ (BAT) Research Committee. The Committee, headed by Dutton, met once a week throughout the 2007/2008 academic year to discuss and conduct research, as well as analyze the results.

BAT, founded by Bendix and Matt Gallagher in 2005, is the first baseball analysis club on a college campus. It has hosted such speakers as Bill James, Alan Schwarz, Keith Law, sportswriters from the Boston Globe and more. It continues to host various speakers and events, as well as provide a forum for intelligent baseball discussion and research on the Tufts campus. For more information, please contact Peter Bendix at
or Chris Dutton at
.

0000
« Previous: The 10 most interesting Rule 5 draft picks, 1981-2007
Next: This would have been clever in 2003 »