A common assertion when projecting baseball players, at least for position players, is that once batters have had a certain amount of playing time, we generally have a good sense of what that player will do in the future. Of course, once a forecaster is able to accurately estimate a player’s true talent, variation in performance will inevitably result due to chance. This variation in performance is normally estimated through a normal distribution. If the distribution of baseball performance follows a normal distribution, then 68 percent of players will perform within one standard deviation of their mean, 95 percent will perform within two standard deviations of their mean, and 99.7 percent will perform within three standard deviations of their mean.
Is baseball performance actually distributed normally, though? I decided to look into one statistic, home runs, to see what kind of distribution emerged. Here is how I did it. I looked at all hitters who had at least 300 plate appearances in a season from 2001 to 2007. I recorded their plate appearances and home runs hit. I then took their Marcel projection to determine what their true talent home run rate was. I used projected home runs per plate appearance to determine a batter’s home run rate. I then divided each year’s sample home run total by the projected home run total to come up with a correction factor for selective sampling issues. I multiplied each batter’s true talent home run rate by the correction factor to come up with a corrected true home run rate. I multiplied each batter’s corrected projected home run rate by his actual plate appearances to determine that player’s predicted home run total. From there, I calculated each hitter’s binomial standard deviation (http://www.saliu.com/standard_deviation.html#Math). After that, I subtracted each player’s actual home run rate by his corrected projected home run rate and divided that by the player’s binomial standard deviation.
This result equals how many standard deviations a player varied from his projected mean. I only included players who had a reliability of 0.8. Reliability is part of the Marcel’s projection and equals how much a batter’s projected performance was regressed to the mean. A batter with a reliability of .8 has his performance regressed 20 percent to the mean, a batter with a reliability of .85 has his performance regressed 15 percent to the mean, etc. The batters who are included in the study are players who have had regular amounts of playing time. These are the sort of players who forecasters generally feel like they have a good idea about. I only included batters who had played their past three years and their sample year in the same ballpark. This is done because Marcels does not adjust for park factors. After placing these restrictions, along with the plate appearance requirement, there were 542 batters in the study. Here are the results, plotted on a histogram:
The x-axis shows the standard deviation. The y-axis shows how frequently the standard deviations occurred. The shape of the histogram has a pretty decent approximation of a bell curve, with one exception. There were 18 batters who varied more than three standard deviations from the mean. Under a normal distribution, we would expect about two batters to vary by more than three standard deviations given our sample size. In fact, there were two batters who varied by over five standard deviations. Under a normal distribution, five sigma events are very rare (one in 3,488,555 chance of occurring), yet there were two in our sample. It’s interesting to note that both those batters, Adrian Beltre and Javy Lopez, were in their walk years. For those interested, Barry Bonds’ 73-home run season was 4.59 standard deviations above his mean. For us to expect 18 batters to fall outside of three standard deviations under a normal distribution, we would need a sample size of 6,000 batters. If we include every position player with at least one plate appearance from 2001 to 2007, we only come up with 4,328 batters.
Given these results, there are still a few things to note about this study. A player’s true talent is not stagnant for a season but changes over time. Marcels is projecting a player’s true talent for a season based on a player’s past performance, but that does not mean a player can’t see his true talent change. Also, there is a selection bias in this study. For a player to achieve 300 plate appearances in a season, he is usually doing something right. For one thing, he has stayed healthy enough to accumulate 300 plate appearances. For another thing, we are likely to see more players who had better than expected seasons than players with worse-than-expected seasons by creating a playing time minimum. Note that despite this selection bias, there were much more hitters who performed below the mean (294) than above (248). In a normal distribution, we would expect there to be equal amounts. However, there were 16 batters who performed over three standard deviations above their mean but only two batters who performed over three standard deviations below the mean. This is likely explained by the selection bias. If a player is performing much, much poorer than expected, there is a good chance he won’t make it to 300 plate appearances.
Now, let’s go back to the original question. Does variation in a hitter’s home run output follow a normal distribution? Even given the selection bias, I would have to say no. For one thing, the distribution wasn’t close to being symmetrical. For another, the amount of hitters who have over three sigma events was well above what one would expect. While other components of performance may follow a normal distribution, I do not think home runs is one of these areas. This would suggest that there is more home run upside than one would normally expect from players who have established a track record. Other components of baseball performance need to be examined in the future as well as this could lead to a better understanding of forecasting and risk.
What do these results mean? For one thing, when it comes to home runs at least, I think this suggests that chance plays a bigger role than many people think. This makes some sense as a few feet or a strong wind can make a home run into a double or out and vice versa. Perhaps if we looked at extra base hit rate or isolated power instead of home runs we wouldn’t see so many three sigma+ events. Additionally, while many forecasters and writers will look for trendy, young breakout picks at the beginning of each year, sometimes established veterans have the biggest breakouts. Also, there may be more downside than many expect, at least when it comes to home runs. Given the hitters who were over three standard deviations above their mean, I’d imagine there’d be more hitters who perform three standard deviations below their mean. It’s just that many of these hitters don’t make it to 300 plate appearances. So while many will jump on the bandwagon of the next young big thing, there appears to be hidden upside, and possibly downside, among veteran hitters. Predicting whose these veterans are, if they are predictable, appears to be a difficult task.