It is difficult to understand why statisticians commonly limit their inquiries to averages, and do not revel in more comprehensive views.
- Francis Galton, who invented correlation, discovered regression to the mean and founded eugenics. But that’s another story.
Last week, I published an article that highlighted the White Sox’s unique distribution of runs scored per game this year. The article received some attention and generated some e-mails, in which several people questioned whether I had learned anything really useful. In retrospect, I probably gave the complicated subject too little explanation, so I intend to make it up to you today.
I collected all the game results from 2000 through 2004: a total of 12,142 games, or 24,284 different instances of runs scored. (God bless Retrosheet). I then added up the number of games in which teams scored a specific number of runs. Here’s the result:
There was an average of 4.82 runs per game scored during these five years, but you can see that the most frequent numbers of runs scored were three (13.4%), four (12.9%) and two (12.1%). In other words, the most common numbers of runs scored were all less than the overall average.
There’s a simple explanation for this: teams can’t score fewer than zero runs, but they can score as many as the opposition will allow. In other words, there is a lower limit on runs scored, but no upper limit. And this has some interesting implications.
For instance, if your league averaged five runs a game, and your team scored exactly five runs in every game, it would typically have a .600 winning percentage instead of .500, even though it had scored the average number of runs. That is the power of looking at distributions instead of averages.
To further illustrate the point, here is a table of the winning percentage of teams that scored exactly the following number of runs per game from 2000 through 2004, along with the incremental impact each run scored provided on winning percentage:
RS Win% Diff 0 .000 1 .077 .077 2 .208 .131 3 .339 .131 4 .471 .132 5 .593 .122 6 .686 .092 7 .776 .090 8 .840 .064 9 .874 .034 10 .921 .047 11 .939 .018 12 .963 .025 13 .987 .024 14 .978 -.009 15 .976 -.001 16 .983 .007 17 1.000 .017
In terms of winning ballgames, the second through the fifth runs have the most impact, followed by the sixth and seventh runs, and then the first run.
Of course, the distribution of runs scored (and their impact) changes as the average number of runs scored changes. When Bill James first published this kind of data nearly 20 years ago (in the New York Mets section of the 1986 Baseball Abstract) he found that the first run had about the same impact as the next four in the National League, which averaged 4.07 runs per game, but the first run was worth less in the American League, which averaged 4.56 runs per game.
Let’s return to the last five years and look at the difference in distribution between the American League (which averaged 4.97 runs/game) and the National (4.68):
The number of times teams scored zero through four runs was higher in the lower-scoring National, and the number of times a team scored six or more runs was higher in the American League. The curve moves to the left as the average decreases, but not uniformly. More like a wave breaking against the “zero” barrier.
The distinction is kind of subtle, so let me give you a more dramatic example: the 2003 Dodgers, who were the lowest scoring team of the last five years and one of the worst offensive teams of all time:
There are more ups and downs on this graph, because the number of games is smaller. But you can clearly see the pattern. The Dodgers scored zero to four runs more than average and more often than seven runs less than average. Of course, the opposite is true of high-scoring teams. Let’s graph the highest-scoring team of the last five years, the 2000 White Sox:
The White Sox really hit their stride at six runs and more, which brings up an issue. The most important runs in a game are numbers two through seven, if your pitching staff allows runs according to a “normal” distribution. And low-scoring teams tend to score relatively more runs in that range. So low-scoring teams actually yield more wins per run from their offense than high-scoring teams do.
The impact isn’t quite as strong as these graphs imply, because teams that score eight or more runs have already scored runs two through seven, but the 10th, 11th and 12th runs aren’t adding as much as the first run scored by a low-scoring team.
You can calculate how much an offense contributed by multiplying the distribution of its runs scored per game times the average winning percentage for that many runs scored. For instance, the 2000 Brewers scored four runs in a game 33 times, 12 more than the overall average of 21. Teams that scored four runs had a .471 winning percentage, which means that the Brewers’ offense contributed 5.6 wins (12 times .471) by scoring four runs that many more times. If you do that for all runs scored and add them up, you get a composite picture of what the offense contributed to the team’s winning percentage.
In fact, the offense that contributed the most wins over the average distribution was that of the 2003 Dodgers’ team, about 5.5 wins more than you would expect given their average number of runs scored. The offense that contributed the fewest wins compared to the average distribution was that of the 2000 Rockies, whose run distribution looked like this:
Low-scoring teams are shut out more often too. But teams that aren’t shut out often and that score two through seven runs a game more often than average make the biggest contribution of all. This just doesn’t happen very often, maybe once every few seasons. The 2003 Indians, for instance, scored only 4.3 runs a game, but they had a great spike at three runs a game:
And that’s why I was impressed with the Sox’s run-scoring distribution so far this year.
I haven’t even touched on the implications of all this for pitching staffs, runs allowed or the Pythagorean formula. Maybe next time…
References & Resources
Jon Daly covered the 2004 distribution of both runs scored and allowed in this offseason article.
There is a discussion thread about this article at Baseball Think Factory, where it’s been appropriately pointed out that I didn’t include park factors in my analysis of the Dodgers and Rockies. I purposely did that because I didn’t want to make my analysis more complicated, but I should have mentioned it.