Where, oh where, has that grounder gone?by Colin Wyers
December 18, 2009
Previously we've discussed some ways that observer location could affect the scoring of air balls. It's a tough topic to study; there are only so many ballparks available to study, which means one has to make do with a smaller sample than one might like.
What we can say for right now is that there appears to be a relationship—slight, but statistically significant—between viewing height and line drive rate. Further study—and a better model for line drive rate—could perhaps show the relationship is stronger or weaker than this preliminary study has. I do want to emphasize that the relationship we are seeing in the data is what we should expect to see based upon what we know about optics, although depending on how you feel about the matter you could consider that either another data point in support of the idea or something that's encouraging confirmation bias for me.
So let's look at ground balls for a moment. A ground ball is much simpler to study. We don't care too much about the trajectory of the ground ball. We're simply interested in knowing the direction it's headed, typically expressed in the angle it travels from home plate. So how well do we know that?
What we think we know
First, just to be sure we're on the same page, let's define a ground ball as (practically) any ball that lands before it reaches the outfield, or a ball that at least we would expect to land before reaching the outfield. (The exception comes when someone drops or misses an infield fly ball; say he loses a popup in the sun.) There may be a few cases where a low line drive and a grounder can be confused, but they're subtle and rare and probably don't affect our analysis much. Suffice it to say that everyone probably has the same conception of when a batted ball is a ground ball.
The next question is, where does it go? Typically, the field is divided into "zones" that each ball is assigned to, like so:
That chart is based upon the one for Project Scoresheet, the largest source for hit location data available freely and the source for Retrosheet's play-by-play data from 1989 to 1999. Those are the data we'll concern ourselves with for the time being.
The process of collecting that hit location data is simple: In its most basic form, a human being sits there and makes a judgment call. This is true for almost any source of hit location data we have available. The question isn't whether humans make errors; we know they do. The question is about bias: Do those errors occur in ways that don't wash out of our sample with enough repetitions?
Looking for park effects
The most obvious control over where a ground ball goes is the handedness of the batter: A left-handed batter will tend to hit the ball toward the first base side of the field, while a right-handed batter will tend to hit the ball toward the third base side.
We already have some reason to believe that pitchers have little to no control over where their ground balls end up. It doesn't even appear that they have a significant impact on which side of the field a grounder is hit to, once you account for the handedness of the batter. (Of course pitcher handedness is a crude proxy for batter handedness in the aggregate, so if that's all you have it's better than nothing.)
So here's what we'll do. Let's take the average rate that a grounder is hit in each zone, broken down by batter handedness. Using that, we can calculate an "expected" rate of balls in zone for each park from '89 to '99. Then, let's compare that with the actual and see what we get.
Looking at the root mean square error between expected and observed rates in each zone, we get an average of .020, with an average of about 2,000 balls in play at each park. Looking at expected random variance, we see that it should explain an average error of only .008. In other words, over an extended period (sometimes 10 years), we see a difference in the distribution of ground balls between zones that we can't easily explain through random chance.
What does it mean?
As far as I can tell, there are these potential explanations for this difference:
- Differences in the field of play. For instance, a "crowned" field will see more ground balls at the sides than up the middle (relative to the average) because the slope of the field will tend to push grounders toward the foul lines.
- Individual scorer bias. For instance, some scorers may have been more likely to score a ball in the 'tweener zones, like 56 and 34, if it passed between the two fielders than others.
- Observer position bias. Similar to the problem with line drives, the viewpoint of the scorer could impact what zones he assigns a ball to. So in some parks, a scorer might be more likely to score a ball as being toward the line, or toward the center.
- Particular hitting styles. Any one individual hitter shouldn't have a significant impact on our total figure. But if a team has a certain hitting philosophy and is able to impart it to all of its players efficiently, that could move the needle a little.
Given the data at hand, it's hard to say for sure how much each effect acts upon what we see in the data (and on the field of play). I think at the very least it's safe to say there is a significant difference in the recorded distribution of ground balls in play from park to park, one that does not seem to be attributable to random error. What we don't know yet is how much of that is attributable to observer error and how much of that is attributable to actual differences in batted ball distribution.
References and Resources
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".
A bit more of a technical explanation: First I figured the rate of balls in each zone per total balls in play for each park. Then I calculated the average balls in play rate in each zone, based upon batter handedness. Using that, I computed an "expected" average based on the number of right- and left-handed batters in each park. Then I calculated the RMSE between the expected average and the observed. To figure out the expected random error for each zone, I calculated it as:
using the number of BIP in each park, so that parks with a smaller number of BIP (parks that were in service for fewer seasons, mostly) had a higher expected random variance. Then I took the weighted average of each.
Colin Wyers knows exactly how much of a nerd he is. He is very interested in hearing about any other concerns you may have; you can reach him by e-mail, and he will try his best to respond in a timely fashion. He also blogs at Statistically Speaking.