A couple of weeks ago we offered several ways to try to quantify the excitement of baseball games using Win Expectancy and Leverage Index. At the end of this article, we’ll present the ultimate list of the greatest baseball games since 1974. That will end any debate on the subject.
(You know I’m not serious, don’t you?)
A quick summary of part one
We’ve identified several variables, each measuring an unknown portion of one (or more) of three latent dimensions: equilibrium, rally and late game importance. In most cases there is some overlapping in what the variables are measuring; for example, here is how a couple of proposed variables were introduced in part one:
To get a sense of the importance of the last phases of a game, we can look at when the moment with the highest Leverage Index occurs. We can indicate the moment as a percentage of game played.
Running a correlation between the Leverage Index and the percentage of game played, we can gauge the increasing tension of tight games, as opposed to the fading interest of a lopsided contest.
The two are obviously highly correlated, since a game with a growing Leverage Index has by definition the highest Leverage Index occurring late in the game. The correlation between the two variables is .85, meaning that 72 percent (.85 squared) of the variation of one variable can be explained by the variation of the other one. Thus the loss of information if just one of the two is used would be minimal; however, it would be ideal to be able to keep the non-overlapping part of information while removing the “double counting.” More on this in a few paragraphs.
A few more ways to skin this cat
Commenting on part one, Paul suggested counting the number of times the Win Expectancy crosses the 50 percent line “as a way to try to find back-and-forth type games.” This 1982 epic duel between the Dodgers and the Cubs has the highest count with 29; the Tigers-Twins tiebreaker of 2009, which gave Paul the idea, has 16 crossings.
Image courtesy of FanGraphs.
I also came up with another couple of candidate variables.
One looks at the moment when the winning team cashed the game. I decided to identify that moment as the one when the Win Expectancy for the winning team goes over 90 percent for good.
For example, in this interleague game between the White Sox and the Reds, Cincinnati’s chances of winning the game went from 75.8 percent to 93.1 percent when Ramon Hernandez lined out to shortstop and Jerry Hairston was doubled off second base; that was the 66th play of the game, and the Win Expectancy for the Reds never fell below the 90 percent line for the rest of the game (which lasted 75 plays). In such a case, the moment when the winning team cashed the game is marked at 88 percent.
The other variable added since part one looks at the moment when the losing team has its highest Win Expectancy. The combination of this with the actual maximum Win Expectancy value for the loser (as seen in part one) should give a better information about the rally and its timing. Here in the first game of a doubleheader between the Tigers and the Indians (1980), Detroit got its highest Win Expectancy (98.3 percent) when it had the game all but sealed (up by two, bases empty, two outs) in the bottom of the ninth, when 96 percent of the regulation game had been played. Gary Gray had other plans that day: He homered two batters later to tie the game and drove in the winning run in the 13th.
Combining and not double counting
Factor analysis is an advanced statistical technique which can be very summarily described as doing the following: Combine a high number of more or less correlated variables into a smaller number of uncorrelated ones; in the process of going from several dimensions to a bunch (usually two or three) of them, the method aims at losing as little information as possible; finally, the resulting variables computed as a combination of the original ones (factors, that is their name) should identify latent traits of the observed phenomenon.
It looks like the right tool for the problem at hand. The ideal result for the exciting games problem would be being able to reduce the 12 variables we have used into three factors identifying equilibrium, rally and late game importance.
Luckily, that’s sort of what happens when performing a factor analysis on the complete data set of the regular season games played since 1974. Three factors explain 76 percent of the variability captured by the 12 original variables. (Warning: it’s not a given that the original variables explain the whole phenomenon of game excitement; on the contrary, on a subjective field like this one, we can be sure that’s not the case).
Looking at the correlation between each factor and each original variable helps to understand what the factors measure.
(Correlation values between -.3 and .3 are not reported.)
Factor1 Factor2 Factor3 crescendo 0.86 0.32 90th pctl LI 0.65 0.47 0.43 mean WE swing 0.66 0.50 0.50 time decisive play 0.58 0.44 0.37 time game cashed 0.79 0.36 highest LI 0.66 0.48 time highest LI 0.83 highest loser WE 0.72 top play WPA 0.31 0.78 time highest WE loser 0.35 0.64 0.37 distance 50-50 WE -0.57 -0.70 crosses 50-50 WE 0.37 0.65
Equilibrium, rally and late game importance—once more, with feeling
The first factor is highly correlated with the game crescendo (the correlation between time of the game and Leverage Index), the moment when the highest Leverage Index occurs (time highest LI), and when the winning team put its hands on the game for good (time game cashed). It appears this factor defines the importance of the final part of the game.
The game with the highest score on this factor is the Mets-Cardinals marathon of last April, with all the important plays occurring past the ninth inning. Coincidentally, the second game ranked by factor one was played in 1974 by the same two teams.
For an example of a non-infinite game scoring high on this measurement, look at this Rangers-White Sox game from 1988. All the red bars (indicating Leverage Index of five or higher) are at the end of the contest.
Image courtesy of FanGraphs.
The second factor is correlated with the Win Probability Added by the biggest play of the game (top play WPA), the highest Win Expectancy reached by the losing team (highest loser WE) and the moment when the losing team reached its highest Win Expectancy (time highest WE loser). Summing up, this factor captures the rally component of games.
The game scoring highest on this factors is a Padres-Dodgers affair from 1977: The home team actually rallied from just a one-run deficit, but that happened with two outs in the bottom of the 10th. The Expos were the leading actors in the match ranked second by the rally factor, at the expenses of the Padres: they came back, again in the bottom of the 10th, from a 7-4 deficit.
Finally, the third factor is related with the closeness to the 50-50 Win Expectancy line (distance 50-50 WE) and the measure proposed in the comments by Paul; i.e., the number of times the Win Expectancy crosses the same 50-50 line (crosses 50-50 WE). Thus, factor number three is the one identifying equilibrium.
At the top of the ranking from factor three we have a 13-inning affair between the Twins and the White Sox, back in 1982. The game was never in the hands of either team, the only big break being a two-out wild pitch by Jeff Little which brought Carlton Fisk home with the go-ahead run in the eighth (quickly answered by Randy Johnson‘s homer in the top of the ninth).
It’s time to rank every regular season game played since 1974 according to the factor analysis; to crown the most entertaining game ever, the factors should be further combined in a single index. That could be done in several ways, each one having its own merits and faults. One could be giving out subjective values to each factor. I could lean toward ranking equilibrium higher, while someone else would prefer rally games.
To provide the rankings that follow, I simply summed the ranking of a game according to each factor and sorted the games according to the sum obtained.
So here are the top 10 games since 1974:
1. Brewers @ White Sox – May 8, 1984.
2. D’Backs @ Giants – May 29, 2001.
3. Padres @ Expos – May 21, 1977.
4. Mariners @ Angels – April 13, 1982.
5. Cubs @ Phillies – Sept. 29, 1980.
6. Brewers @ Expos – April 24, 2002.
7. Orioles @ Red Sox – Oct. 3, 1976.
8. Padres @ Dodgers – Sept. 13, 1982.
9. Reds @ Braves – July 18, 2007.
10. Expos @ Astros – July 7, 1985.
And here are the best since 2006, all available thanks to the MLB.tv archives (no links to results, in case you are planning to watch them):
1. Reds @ Braves – July 18, 2007.
2. Rockies @ Padres – April 17, 2008.
3. Reds @ Padres – May 25, 2008.
4. Dodgers @ Padres – April 29, 2007.
5. Athletics @ Blue Jays – April 10, 2008.
6. Pirates @ Cubs – May 8, 2008.
7. Phillies @ Nationals – Sept. 27, 2006.
8. Dodgers @ Cardinals – July 29, 2009.
9. Cardinals @ Red Sox – June 22, 2008.
10. Red Sox @ White Sox – July 9, 2006.
There are a couple of natural directions this series can take.
One is taking into account the importance of the game in the contest of season, to rank games that are not only exciting, but also meaningful; to do so, the Championship Leverage Index tool could be borrowed from Sky Andrecheck. Unfortunately running the code to assign each game its Championship Leverage Index requires a couple of hours per season, so I won’t go after it in the immediate future. What you’ll likely get in the coming weeks is the ranking of postseason games and series; thus we’ll come full circle in revisiting Dennis Boznango’s articles from 2005 (part 1 – part 2).
References & Resources
Keep a look on THT Live during the next few days. A gift for you is coming soon!