Toward a Probability Distribution Over Batted-Ball Trajectories

This heatmap shows the joint probability density function of launch angle and exit velocity. (via Scott Powers)

This heatmap shows the joint probability density function of launch angle and exit velocity. (via Scott Powers)

Editor’s Note: This piece was initially given as a presentation at the marvelous 2016 Saberseminar.

If you haven’t noticed, the first tip of the Statcast iceberg has been made available to the public at Baseball Savant. You can download pitch-by-pitch data in .csv format, including pitch metrics like spin rate and spin axis. But this article is about the two types of batted ball data available there: exit velocity and launch angle.

Exit velocity is the speed of the ball off the bat and it corresponds to distance from the origin in the polar coordinate plot above. Launch angle is the vertical angle of the trajectory of the baseball after contact. The above figure is a heatmap showing the joint probability density function of launch angle and exit velocity, across all recorded batted balls this regular season through July. Red and orange represent trajectories that are more common; blue and green represent trajectories that are less common. Actually, the pic above is slightly cropped. Let’s see the full version:

density-full

While the sabermetric community has made progress on estimating the run value of a batted ball as a function of its trajectory, I have not found anything written in the public sphere about the probability distribution of trajectory as a random variable. The goal of this article is to predict, for any batter-pitcher matchup, the distribution of possible batted ball trajectories.

The value of this endeavor is twofold. First, such a model could be used to evaluate batters and pitchers on the basis of batted balls, while simultaneously controlling for sample size, park effects and opponent quality. Second, and more importantly, the predicted trajectory distribution could inform fielder positioning.

Because the publicly available Statcast data are still relatively new, many of the results below are of an exploratory nature. This is a first attempt toward a probability distribution over batted ball trajectories, and hopefully more refined work will follow.

The Strategy

As a batter, you can’t hit the ball hard (high exit velocity) without making solid contact. Launch angle is, in some sense, a measure of quality of contact. Angles of 50 or -30 degrees, for example, signal poor contact. The angle says something about what the speed off the bat is likely to be, so the distributions of launch angle and exit velocity must be modelled jointly, not separately.

The strategy is first to predict the distribution of launch angle for the batter-pitcher matchup and second to predict the conditional distribution of exit velocity given the launch angle. The joint distribution, then, is just the product of these two distributions.

Focusing first on launch angle, the figures below show that this variable adheres very closely to a normal distribution, across all batted balls in the dataset. In fact, I have found that distributions of launch angle on a batter-by-batter or pitcher-by-pitcher basis are also consistent with a normal distribution, as are residuals after controlling for batter, pitcher and other factors.

hist-angle qq-angle

My conclusion is that a normal distribution is appropriate to model the randomness in launch angle. I find this to be a very pleasant surprise because I would have expected the cosine or some other transformation of the angle to follow more closely a normal distribution.

Given that it is a normal distribution we are dealing with, the task is to estimate the mean (center) and variance (spread) given the batter and pitcher involved. Assuming these two parameters can be estimated by a linear combination of variables (an assumption I will validate later), this amounts to generalized least squares with unknown variance.

Rather than writing down and optimizing a likelihood function, here I will go with the simpler feasible generalized least squares from econometrics literature, which in this case is a three-step procedure:

  1. Use ordinary least squares to estimate the expected angle for each batted ball, leading directly to the residual for each ball, which is the difference between observed and expected angle.
  2. Using regularization in the form of a ridge penalty, regress the squared residuals on the same variables as in Step 1. This gives an estimate of the variance of the angle of each batted ball.
  3. Solve the generalized least squares problem regressing angle on the same variables as in Steps 1 and 2, using the estimated variances from Step 2. As in Step 2, apply regularization with a ridge penalty.

Specifically, the variables on which I regress are the identity of the batter, the identity of the pitcher, the identity of the ballpark (for park effects), an indicator of whether the batter is on the home team and an indicator of whether the batter has opposite handedness relative to the pitcher.

For a batter-pitcher matchup, the model from Step 2 predicts the standard deviation in launch angle, and the model from Step 3 predicts the mean launch angle. The next section presents the results of fitting the model from Step 2.

A Hardball Times Update
Goodbye for now.

Launch Angle Standard Deviation

Carlos Gonzalez of the Rockies has an average launch angle of about 10 degrees. Coincidentally, 10 degrees is my average launch angle when I go to my local batting cage. But that’s because half the time I hit the ball up at 50 degrees and the other half of the time I hit the ball down at -30 degrees. This illustrates the importance of variation in launch angle.

HIGHEST AND LOWEST LAUNCH ANGLE VARIATIONS
Highest 5 Angle S.D. Lowest 5 Angle S.D.
Todd Frazier 26.7 Joey Votto 21.9
Maikel Franco 26.6 Jon Jay 22.1
Kevin Plawecki 26.6 Nick Castellanos 22.3
Steven Wright 26.4 Starlin Castro 22.3
Kevin Kiermaier 26.4 DJ LeMahieu 22.3

The table above gives estimated launch angle standard deviation for each batter (against an average pitcher) and pitcher (against an average batter). The top five and bottom five are shown. Joey Votto has the lowest variation in launch angle, and his former Reds teammate, Todd Frazier, has the highest.

Steven Wright is highlighted because he is the only pitcher to appear in this table. It makes sense that a knuckleballer would be the top pitcher in terms of standard deviation in batted balls against.

Mean Launch Angle

This section presents the results of fitting the model in Step 3 of feasible generalized least squares, described above. The table below gives the top five and bottom five batters and pitchers by expected launch angle when facing an average pitcher or batter, respectively.

HIGHEST & LOWEST EXPECTED LAUNCH ANGLE WHEN FACING AN AVG. OPPONENT
Highest 5 Mean Angle Lowest 5 Mean Angle
Ryan Buchter 23.4 Christian Yelich 3.9
Zach McAllister 22.1 Cameron Maybin 4.0
Koji Uehara 21.0 Jeremy Jeffress 4.2
Bryan Holaday 21.0 Jeurys Familia 4.2
Nolan Arenado 20.7 Marcus Stroman 4.3

All pitchers are highlighted, and we observe that there are more pitchers in this table than in the previous one. Intuitively, this makes sense because pitchers have more control over the trajectories of batted balls against them based on what types of pitches they throw and where they locate them.

A key assumption that I have made in this model is additivity. Under this assumption, for example, a batter whose average launch angle is 5 degrees above average and a pitcher whose launch angle is 5 degress above average would be expected to produce a launch angle 10 degrees above average if they faced each other. We can check the validity of this assumption by plotting residuals from the model against predictions. If the assumption is wrong, we would expect an upward or downward trend.

angle-diagnostic-zoomed

The figure above shows the results of aggregating all batted balls by predicted launch angle and averaging the residuals, with 95 percent confidence intervals for the mean. We see evidence of a slight upward trend between predictions and residuals, suggesting slight sub-additivity. But it is a small effect, so I am content with concluding that the additivity assumption is not terribly off-base.

The additivity assumption is appealing because we know from The Book that fly ball hitters struggle against fly ball pitchers, and ground ball hitters struggle against ground ball pitchers. This is consistent with the hypothesis that facing an opponent who tends to produce the same type of trajectory will lead to even more extreme trajectories.

Mean Exit Velocity

Now that we’ve gotten launch angle out of the way, let’s move on to exit velocity.

The figure below shows for each launch angle the average transformed exit velocity, with 95 percent confidence intervals. The black curve has a cosine shape and fits the data very well between roughly -35 and 45 degrees, accounting for 87 percent of all batted balls. The fit is poor outside of this region, but the standard errors are much higher there, too.

trans-v-angle-curve

Based on the above, to account for launch angle when modeling exit velocity, we’ll include an additional linear term in our model for the cosine of the difference between the launch angle and 10 degrees. Otherwise, we fit the same ridge regression as we did for angle mean and variance in the previous two sections, only now we use exit velocity as our response variable.

EXPECTED EXIT VELOCITY AGAINST AVERAGE COMPETITION*
Highest 5 Mean Speed Lowest 5 Mean Speed
Giancarlo Stanton 99.0 Billy Burns 87.5
Mark Trumbo 98.9 Billy Hamilton 88.2
Nelson Cruz 98.5 Dee Gordon 88.5
Matt Holliday 98.0 Jose Iglesias 88.7
Ryan Zimmerman 97.6 Jarrod Dyson 88.9
* Conditional on a 10 degree launch angle

The table above shows some of the results of fitting this model. For each player, the table reports expected exit velocity against average opposition conditional on a 10 degree launch angle. This is a better measure of the “power” tool than average exit velocity, because it controls for the “contact” tool.

Giancarlo Stanton tops the list, which suggests that we must be doing something right. And as a whole, the players appearing on both sides of the table match intuition. No pitchers appear in the table, and this also makes sense because pitchers should exhibit less control over the pure power of the swing than batters do.

Putting It All Together

Now we have, for a given batter-pitcher matchup, estimators for the distribution of the launch angle and the conditional distribution of the exit velocity given the launch angle. The joint distribution, then, is just the product of these two distributions. Below are two extreme examples of estimated trajectory distributions for batter-pitcher matchups.

powers 3

Nolan Arenado is a fly ball hitter, and Chris Young is a fly ball pitcher. As expected, the predicted trajectory distribution from their faceoff puts a greater weight on fly balls. The same is true for Christian Yelich batting against Marcus Stroman, but for ground balls.

We didn’t need all of these Statcast data to tell us that ground ball hitters facing ground ball pitchers should hit ground balls. But we’ve quantitatively estimated the likelihood of each trajectory. Since we know how to estimate the expected wOBA for each trajectory, this means we can quantify the expected wOBA for each matchup. More correctly, we quantify the expected wOBAcon, or wOBA on contact.

The predicted trajectory distributions above include EwOBAcon (expected wOBAcon). The league average wOBAcon this season is around .370, so the Arenado-Young (.340) and Yelich-Stroman (.296) matchups both lead to below-average wOBAcon. In case you don’t want to be limited to these two examples, I built a Shiny app so that you can explore the results of this model for the batter-pitcher matchup of your choice.

Next Steps

In order for this work to be of any use in fielder positioning, we need to include batted ball direction, or horizontal angle, in the predicted trajectory distributions. Currently, the data from Baseball Savant do not include this, but they do include (x, y) coordinates of batted ball location for most balls. This should be good enough to start. Furthermore, to get a complete picture for player evaluation, non-batted ball events need to be considered. Right now, swing-and-misses are ignored because there is no sensible way to define exit velocity or launch angle for these swings.

One troubling assumption is that the optimal launch angle (in terms of maximizing exit velocity) is about 10 degrees. This is probably true for an average batter whose swing plane is 10 degrees, but a batter whose swing plane is 25 degrees probably has an optimal launch angle closer to 25 degrees.

Note that I have presented you with this model, but I have not yet given you a reason to think it is any good. Model validation is an important next step. Finally, incorporating pitch types and locations into the model could improve the precision.

References & Resources


Scott Powers is a PhD student in statistics at Stanford University. Find much of his sabermetric work here, and follow him on Twitter @saberpowers.
7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Peter Jensen
7 years ago

Scott – Congratulations on a well researched and well written article. I think the relatively large number of hit balls that are missing Statcast data may affect the actual numbers and player rankings that you have calculated. I don’t see any mention in the article of what time period you used but missing data was missing on about 14.4 % of the hit balls in 2015 and about 12.8 % in 2016 through 8/3. The missing data is not evenly distributed through the launch angles with around 50 % of balls classified as popups missing data and 17 % of ground balls missing data while line drives and fly balls are only missing 2 to 3 %.

It would be helpful to know whether there are specific vertical angles that Trackman has difficulty in tracking especially for ground balls. Also to know whether the percentages of missed data are roughly the same for all venues.

Scott Powers
7 years ago
Reply to  Peter Jensen

Thanks Peter. The time period I used is 2016 through 7/31. I agree that missing data affect the results. Before acting on these results, this problem is one that needs to be explored further. Maybe the missing data could be partially imputed using the trajectory classification and batted ball location.

Eli Barnett
7 years ago

Very interesting to see a parametric take on this – earlier this summer, when I had a lot more spare time, I did a bit of exploratory analysis of the statcast data and had wondered if something like this were possible, but I had been only toying with nonparametric methods and had no real idea how I might go about doing it.

It would be nice to see some cross-validation to test the predictiveness of statcast-trajectory-based stats like these. It seems to me the real question is “over what sample size are these stats more predictive of near-future performance than career averages?”, which is a pretty deep question and will probably require a few more years of statcast data to really chew through.

phantasia
7 years ago

this right now http://goo.gl/V2cU4L

sooyoung
7 years ago

boom boom or a lady http://goo.gl/PmO0Go

yoona
7 years ago

It seems to me the real question is “over what sample size are these stats more predictive of near-future performance than career averages?”, which is a pretty deep question and will probably require a few more years of statcast data to really chew through. http://goo.gl/V2cU4L

Jeffrey Cisyk
6 years ago

Scott, you mentioned the use of x, y coordinates of the batted ball – I believe this is coded as hc_x and hc_y when using the statcast search. Do you know the location of home plate and other bases when using their coordinate system? Thanks.