We’ve all seen it: a deep fly ball seemingly drifting into space and finding a home in the bleacher seats, only to find itself in the center fielder’s glove. Or, the other way around, where high fly balls keep carrying due to wind factors. Let’s face it—home runs are not created equal, and at times are not a good display of a player’s power or batting skills. We equate overpowering shots to right-center field by Prince Fielder with balls that graze the more-than-generous right field wall of Yankee Stadium. What I mean is, there are more variables than just pure distance that determine whether or not a fly ball becomes a home run.
With this in mind, we can run a regression model to compute a fixed probability that a fly ball will turn out to be a home run. I received a large data set (many thanks to Greg Rybarczyk at Hit Tracker) that spans the 2006-2008 seasons of three players (Adam Dunn, Manny Ramirez and Jason Bay). The data includes observational and calculated data (in the similar ways of Hit Tracker’s data – i.e. True Distance, Elevation Angle, etc.) on every long fly ball the players hit, totaling a tad over 700 observations. Included variables are ballpark, date & time, and the outcome of the play (single, double, home run, out, etc.)
As you can see from the graph above, the outcome of the play isn’t so clear when only given the elevation angle and distance traveled summary of the ball. All the outcomes are generally scattered, so we cannot conclude any real association. I superimposed two boxes to easily show how similar balls can have different outcomes. In the case of the right-side box, a slightly different elevation angle could mean the difference between a home run and a fly ball.
In this next plot we’re seeing the outcomes of the play split into two events: a home run or a non-home run (the latter equating to zero, or the orange points). We see both smoothed curves have the same shape, with the home runs curve reaching further distances on average. However, the smoothed curves don’t show how the blue and orange points are still very much intermixed. The likelihood of a home run (based on knowledge of the ball’s angle and distance) is quite sporadic.
For those who are statistically inclined, I used a logistic multiple regression model to find any pattern for predicting home runs. This model will essentially spit out a probability of a ball becoming a home run, given a bunch of variables. The equation used for this type of modeling is:
Prob(HR) = 1/1 + exp(-z)
Where z is a linear equation whose coefficients are estimated (later on)…
The variables I used include the true distance the ball reached, time in the air, speed off the bat, elevation angle, horizontal angle (imagine a baseball field where a larger angle equates to left field, etc.) and apex of the ball (highest vertical point the ball reached). After editing the model to include only those that are statistically significant, the final coefficient confidence intervals looked like this:
2.5 % 97.5 % (Intercept) -69.55676316 -48.86556510 True.Dist. 0.08764763 0.12669753 Time -9.12581512 -6.25056613 Elev Angle 0.85324327 1.24073619 SOB 0.16324949 0.33018947 Horiz Angle -0.03296355 -0.00181532
From this we’re seeing that Distance, Elevation Angle and Speed off the Bat are all positively associated with home runs, while hang time of the ball and horizontal angle are negatively associated. This is not breaking news to baseball fans; any physicist could probably come to the same conclusions. It may be a better use of time to focus on the more intangible effects on fly balls, such as temperature, wind, and elevation above the sea (i.e.—Coors Field effect). The next regression I use will consider all the fly ball variables and these other uncontrolled park variables. Running this regression surprisingly shows all variables as statistically significant. Better yet, the AIC (which measures how much error is in the model) from this model is lower than the original. Remarkable stuff!
Now what about using these findings to compare hitters? Maybe instead of looking at home run totals, we can look at each fly ball independently and project its final outcome. As you can see from the graph above, I plotted the fitted values from the model split by each hitter. Most of the fitted values tend to be at the extremes, which coincide with logistic model properties. As for comparing hitters, it looks like Adam Dunn has the most predicted fly balls becoming home runs, while Jason Bay is on the other side of the fence.
In case you were wondering, here are each player’s mean HR prediction for fly balls: Adam Dunn (~50.4 percent), Manny Ramirez (~39.2 percent) and Jason Bay (~34.0 percent). No surprises there really. The order of these hitter’s HR predictions coincide with their career HR/FB rates and our general notion of their hitting style. Dunn swings for the fences or strikes out otherwise, while Manny and Bay display more use of the entire field (though that may be an optimistic statement about Manny’s capabilities nowadays). While Hit Tracker does an excellent job of telling us how far home runs really went, and what park/weather factors impacted the ball’s real projection, we don’t really have an idea of what those factors had on non-home run fly balls. Though I am speculating, maybe this topic of research will increase once the data from Field F/X (previewed in The Hardball Times Baseball Annual 2011) is published. Aladdin was right, a whole new world is out there…
References & Resources
Much thanks goes out to Greg Rybarczyk for letting me look at this dataset.