Elo vs. Regression to the Mean: A Theoretical Comparison

One of these ranges of estimates is wider than the other. (via Adam Dorhauer)

One of these ranges of estimates is wider than the other. (via Adam Dorhauer)

Let’s say a team has won 15 of its first 25 games. How many of its remaining games would we expect the team to win?

This question is central to projecting team (and player) talent levels. We’d expect the percentage to be less than 60 percent, because there are a lot fewer teams that end up winning 97 games than start off winning 15 of 25, which means most teams that start off at that pace don’t keep it up. But just how much do we expect a team playing .600 ball to drop off?

The simplest solution is to use regression to the mean, which is one of the fundamental concepts behind most baseball projection systems. The way it works is basically this: You take the team’s observed record and add a certain number of games of league-average performance. That gives you an estimate of how you would expect the team to perform going forward.

In the case of major league teams, the regression constant—the number of games of league average you add—happens to be around 70 games, meaning you’d add 35 wins and 70 games to a team’s observed record. For a 15-10 team, that means about (15+35)/(25+70) = .526 over 90 total games. If a team is 60-40, that becomes (60+35)/(100+70) = .559. The more games a team keeps up the .600 pace, the more likely it becomes that the team actually is close to a .600 team going forward.

Another way to project future performance is with Elo ratings, a system initially developed by Arpad Elo for chess, which has since been applied to many other games and sports. While not as common in baseball circles as regression to the mean, Elo recently has gained prominence in mainstream team sports as the basis for FiveThirtyEight.com’s team ratings.

Elo aims to accomplish a similar goal to regression to the mean—estimating a team’s strength and projecting its likelihood of winning future games—but it does so via a completely different mathematical approach. Elo works by assigning each team a rating (an arbitrary, usually four-digit, integer value, such as 1500) and having teams trade rating points based on the results of each game. The difference between two teams’ ratings represents the probability of one team beating the other and is used to determine how many points the winner takes from the loser, so that a higher-rated team will take fewer points from beating a lower-rated team than the reverse.

The idea is that as long as a team plays at the same strength as its rating, each loss will cost enough points to offset any gains from wins, and vice versa. If a team is better or worse than its rating indicates, it will win or lose more games than Elo expects and continue gaining or losing rating points until its rating accurately reflects its true strength.

The number of points exchanged in each game is given by the following formula:

ΔElo = K*(Score-Expected)

Score is the result of the game (one for a win, zero for a loss), expected is the expected win percentage between the two opponents based on their ratings before the game, and K is a factor chosen to determine how quickly the ratings change. Nate Silver (editor-in-chief of FiveThirtyEight) suggested in his original Baseball Prospectus article on the subject that a K factor of four is ideal for major league baseball, which FiveFhirtyEight continues to use. (Silver describes the K factor as the “velocity” of the rating changes in the Baseball Prospectus article rather than calling it the K factor.)

Conceptually and mathematically, Elo and regression estimates take fairly different approaches, so it’s not at all obvious how they compare at first glance. Elo ratings can be converted to an expected winning percentage against a .500 team. (For example, if an average team is rated 1500, a 1550-rated team will have an expected .571 winning percentage against an average team.) So we at least can express the two estimates in the same terms, but even then it’s not easy to see how the two relate because of the different steps each takes to get to its result.

A Simulated Comparison

To get a better grasp of the comparison, let’s look at an example.

I simulated 10,000 games for a team with a 95-win talent level against a league-average opponent and calculated its expected winning percentage using Elo (the blue line in the graph) and regression to the mean (the dark red line in the graph) after each game. I also graphed the team’s cumulative win-loss record (bright red):

sim10000

Immediately, we see that Elo is a much more volatile estimate of talent than regression, at least over the long term. The regressed estimate settles into a relatively tight range and hovers more and more closely to the long-term observed performance, sort of like a Randy Johnson who starts off wildly hurling 100-mph fastballs in the general direction of home plate but then learns to throw strikes as his career matures.

A Hardball Times Update
Goodbye for now.

Elo, on the other hand, follows more of a Nolan Ryan mold, continuing to bounce around based on the short-term run of wins and losses without ever really settling down. After a few hundred games, it’s already roughly as precise as it’s ever going to get.

It’s worth pointing out that 162 games isn’t nearly long enough for these long-term differences to manifest fully, so this isn’t necessarily going to be the biggest concern if we’re using only a single season of data. Still, it’s important we understand why the two methods behave so differently in the long term to better understand how they compare in general, and it is something that will come into play if you are putting together long-term ratings.

If we go back to the formula for calculating changes in Elo rating, recall that the K factor was important to determining how quickly the ratings adjust to new information. Specifically, the number of points you gain or lose from a game depends on the rating of your opponent, but the difference between winning and losing a game will always equal the value of the K factor.

For example, using a K factor of four, if a team plays an opponent with the same rating as itself, the team would gain two points for a win and lose two points for a loss. If it plays a stronger opponent, the team might gain 2.5 points for a win and lose only 1.5 points for a loss, but whatever those figures are, they always will add up to four, or whatever value you choose for the K factor.

Four points of Elo is equivalent to a difference of about .00576 in a team’s forecast winning percentage, or about 0.93 wins per 162 games, at least for teams around .500. The further a team gets from .500, the less difference four points makes in the estimated winning percentage, but within the range major league teams occupy, the difference between a win and a loss is generally in the range of 0.90-0.93 wins per 162 games in the team’s projected strength.

This depends only on the K factor (and, to a much smaller extent, how far the team’s rating is from the league average) and not how many games have been played. Whether it is Opening Day or the final game of the season, a win or a loss always will have the same level of impact.

This is different from regression, which starts with an uncertain, highly regressed forecast and then gets more and more precise as the number of games increases and the impact of each game decreases. The basis of regression to the mean is that you know a lot more about a team after 162 games than after one, so each subsequent game adds less new information to what you already had and thus moves your projection by a smaller amount.

Because of this, regression will be quicker to adjust to results at the start of the season, and as the season goes on, it will slow down as it becomes more confident with more data. The point at which regression goes from giving more to less weight to each game than Elo is right around 104 games, which means, compared to regression, Elo will never act like it has more or less than 104 games to work with when determining how much to move its estimate.

(If you’re curious how to get the 104-game value, the difference between a win and a loss in a team’s projected winning percentage using regression to the mean is 1/(n+C), where n is the number of games and C is the regression constant. To find the point where regression gives a game the same impact as Elo in our example, you would solve 1/(n+70) = .00576 for n, which is ~104.)

General Principle No. 1

Using a K factor of four and a regression constant of 70, Elo gives every game approximately the same impact on its estimate of team strength as the 104th game using regression to the mean. Before 104 games, regression gives each game more impact than Elo; afterward, regression gives each game less impact.

Weighting Past Results

Based on the aforementioned, it might be tempting to think Elo is analogous to a regression estimate that is limited to using only 104 games worth of data. We can check this assumption by recalculating the regressed estimate in our sim using only the previous 104 games. The blue line in the following graph shows this revised estimate, while the red line is the normal regression estimate from the previous graph:

elo_regression_comp2

This looks significantly more like the Elo graph than the normal regression graph. Elo does, in fact, behave similarly to a regression estimate with a limited sample size: At some point, Elo hits a wall where it can’t really get any more precise no matter how much data you throw at it.

However, while it is true that Elo is effectively limiting its sample size, just taking the last 104 games is not a very good approximation of that limit. The above graph does somewhat resemble the Elo graph, but it is even more volatile, indicating that we’ve restricted our sample size too much.

Part of the reason just using the previous 104 games doesn’t mirror Elo well is that this approach doesn’t accurately reflect how Elo weights past performance. In this example, any game outside the previous 104 has zero impact on the current estimate. Elo has no such cliff where a game is suddenly dropped from affecting the rating.

If two teams have the exact same performance over the past 104 games, but one was higher rated at the start of that 104-game stretch, that team will still be higher rated, just not by nearly as much. So the impact of those previous games is still there, just muted. Each game’s effect on the current rating gradually lessens until at some point, it becomes negligible, but it is still technically there.

A better approximation of how Elo ratings work would be to apply an exponential decay to each game in the sample, so that each game further back gets less and less weight. If you start with the most recent game getting full weight, and each previous game getting 103/104 of the weight of the following game, then you end up with a geometric series where the total combined weight of all previous games will never exceed 104.

This still gives us an effective sample weight of 104 games, but it reduces the volatility of the resulting estimates to something much closer to what we see in the Elo graph because it still includes information from a broader sample of games.

General Principle No. 2

Elo weights past performance similarly to a regression estimate that applies an exponential decay factor to reduce the weight of each previous game.

Amount To Weight

Before I show the graph demonstrating this, there’s a somewhat technical mathematical wrinkle that needs to be ironed out. When we recalculate the regressed estimates of team strength by weighting the past results, it actually changes the amount we need to regress those results. If we reduce the combined weight of all previous games to 104 and then still try to add 70 games of league average to regress the results, we’ll end up over-regressing, because the number of games of regression has to be weighted, as well.

This is not quite as simple as it sounds, and it’s not entirely straightforward to calculate the revised regression constant. The most intuitive way to weight the regression constant would be simply to treat the 70 games of regression as a 70-game sample and apply the same exponential decay factor as you do to the observed results, which gives you a total weight of 51, but that is not correct either.

It turns out the correct amount of regression to use is roughly half the original regression constant (actually n/(2n-1), where n is the maximum effective weight of the sample based on the decay factor, which is close to 1/2 as long as n isn’t too small). In this case, that would be 35. There’s not enough space in this article to show why, and calculating the revised constant for a regression weighted for recency is important enough to warrant its own separate article, but I will give some evidence that it works later in the article.

Because we have to reduce the number of games worth of league average to use in the regressed estimate, each observed game now has a bigger impact. To match the impact of each game back up with what we get from Elo, we have to offset the loss of 35 games from the regression constant by adding another 35 games to the observed data. In other words, we need to increase our effective weight from 104 games to 139 games, which is equivalent to setting the decay factor to 138/139.

Once we do that, we get the following results for our weighted regression model:
elo_regression_comp1

This is nearly identical to the Elo ratings. It’s not mathematically equivalent, but the correlation between the two is ~0.98, so the weighted regression gives us a very good idea of how Elo behaves in the long term.

General Principle No. 3

Using a K factor of four and an un-weighted regression constant of 70, Elo weights past performance similarly to regression to the mean when you apply an exponential decay to past results using a decay factor of 138/139, which also reduces the regression constant to 35.

Equivalent Sample Sizes

Again, it might be tempting to conclude that, over the long term, Elo is analogous to a regressed estimate that is limited to using 139 games. However, this applies only to the effective weight of a weighted regression. As we saw earlier, this is not the same as taking 139 un-weighted games. There is one more step to calculating the equivalent sample size for an un-weighted regression.

Recall that the weighted regression uses a regression constant of 35 instead of 70. This means that, with an effective sample weight of 139 games, the weighted regression estimate of a team’s expected winning percentage is roughly 80 percent observed data and 20 percent regression. And the point where a normal, un-weighted regression is 80 percent observed data and 20 percent regression is not at 139 games, but 278 (which is twice 139, since the regression constant for an un-weighted regression is approximately twice the weighted regression constant).

This is the point where Elo hits its wall: Over the long term, Elo maxes out its ability to pinpoint a team’s talent level and continues to behave like a regression estimate that is stuck using only 278 games of data.

To justify this statement, I ran another simulation, this time using 10,000 different teams randomly sampled from a major league-like distribution, each playing 1,000 games against a league-average opponent. I again calculated the expected winning percentage for each team using Elo, regression to the mean, and weighted regression to the mean, and then calculated the RMSE for each method after each game:
elo_regression_rmse
The weighted regression with an effective weight of 139 games is roughly as accurate as Elo throughout the sim, and both max out at about the same accuracy as a normal regressed estimate at 278 games, which confirms what we’d expect from our previous calculations. (I’ve also included a weighted regression estimate using a decay rate of 103/104 and a regression constant of 70, confirming that this is indeed a worse fit for Elo than the revised weighting.)

This gives us a framework to approach the Elo-regression comparison by describing Elo in terms of the math of regression. More specifically, it tells us how Elo weights and discards information as our data updates. This is important because it gives us some insight into what each model’s strengths and weaknesses are.

General Principle No. 4

Over the long term, regression to the mean continues to grow more precise the more games you observe, but Elo doesn’t. Using a K factor of four and a regression constant of 70, Elo’s precision peaks and levels out at approximately the same level of precision that regression has after 278 games.

Changes In Talent Levels

The above graphs seem to indicate that regression is much better at estimating a team’s chances of winning, but it’s important to note that these are based on examples where we assume each team has a constant talent level. If teams never get any better or worse, then it would definitely be better not to discard any information from past results. Players and teams obviously do evolve, though, so giving more weight to recent performance makes sense.

Most projection systems weight performance for recency for this reason. Marcel, for example, gives the past three seasons weights of 100 percent, 80 percent, and 60 percent to account for possible changes in talent level. This applies the weights on a seasonal level for simplicity, but it’s really just a shortcut for applying a daily decay similar to what Elo does.

So the way Elo weights past performance is sensible. The main advantage regression has over Elo with regard to weighting isn’t that weighting for recency is less accurate, but that it is much simpler to control the amount of weighting with a regression model than with Elo.

When we wanted to go from a total effective weight of 104 games to 139 games, all we had to do was change the decay factor from 103/104 to 138/139. If weighting is not a serious concern, you have the option of not weighting or weighting by season totals to greatly simplify the necessary calculations and data collection. You can easily test different amounts of weighting and fine tune the decay factor to best fit the results.

With Elo, the weighting is controlled by the K factor, and the way the K factor affects weighting is more opaque and difficult to adjust. Additionally, the K factor also controls how quickly the ratings respond to an initial burst of new data, so finding the best fit for the K factor can be a balance of responding to gradual talent changes with getting the rating to the right level in the first place, and thus may not always give the ideal amount of weighting.

A weighted regression also has the advantage that you can apply weights more dynamically. Elo’s weighting considers the recency of data based on how many games ago it occurred. Weighted regression allows you to define recency in other ways, such as the number of days ago, as Sal Baxamusa’s article did. This is especially relevant if you include multiple years of data into your ratings. Elo will treat 50 games ago as 50 games ago, whether that goes back to the previous year or not. Weighted regression can distinguish between those cases if you apply weights based on the date and not the number of games.

One attempt to address this shortcoming of Elo is the Glicko rating system, which is a modified rating system that introduces a variable parameter, RD, that adjusts the volatility of the ratings based on the amount and the recency of your observed data. Glicko versus Elo is its own heavily-explored discussion in the gaming world, but for our purposes, I’m simply mentioning it to show that, just as weighted regression provides an alternative to how regression to the mean approaches weighting for recency, Glicko exists as an alternative to do the same for Elo ratings.

General Principle No. 5

Weighting for recency is important for reacting to changes in talent level. Elo does this inherently, while regression to the mean requires past observations to be manually weighted. Both regression and Elo models can be modified to account for weighting properly, but it generally is simpler to adjust the weighting of a regression-based model once you have already set up the weights.

Regression and Bias

So far, we’ve established that Elo behaves like a weighted regression over the long term and that the two models are similarly accurate even over the short term. However, there are some key differences. If we return to the graph for the 10,000-game sim, we can see that the weighted regression estimate is consistently lower than the Elo estimate by about .017 points.

This number is important. In our example, we have an average observed winning percentage of .586, which gets 139 games worth of weight, and we add 35 games worth of .500 to regress toward the mean. This gives us a final talent estimate of .569, which is .017 lower than the observed value. In other words, that amount represents the regression toward the mean in our example.

This means that Elo is giving us an un-regressed estimate. As a result, over the long-term, Elo will tend to give too wide a range of projections and will tend to overrate teams that are above .500 and underrate those that are below .500.

For example, in our 10,000-team sim, if we take every team Elo projected as better than .500 at the end of the 1,000-game sample, we get an average projection of .553. For teams projected below .500, the average projection was .448. The actual talent level of those groups was .543 and .456, respectively, meaning Elo gives us estimates that are biased away from .500.

The regression model, on the other hand, projected .547/.454 for teams above and below .500, compared to actual talent levels of .547 and .454 (regression identifies a slightly wider range of talent because it is using more information, as shown above). Likewise, the weighted regression projects .543/.457, compared to actual talent of .542/.457.

This is a problem in the long-term, but there is another, conflicting issue in the short-term. Recall that Elo gives the same impact in the ratings to games at the start of the season as to games at the end of the season, meaning it can’t quickly home in on a team’s talent level and then slow down like regression. Because of this, Elo takes longer to move away from .500 at the start of the season and is effectively over-regressed. This problem persists even after 162 games, so Elo moves too slowly to properly identify the spread in talent among teams over the course of a single season.

The following graph shows the spread in team projections for the first 162 games from the 10,000-team sim, with the actual talent distribution represented by the grey bar at the left, and the 97.5th percentile marked for Elo, regression, and actual talent.
elo_regression_comp3
You can see Elo’s estimates are clearly concentrated toward the center of the Master Spark figure, while the red fringe surrounding it shows what a properly regressed range of estimates should look like. And if we go back to check the sim after 162 games, the average Elo projections for teams projected above and below .500 are .535/.465, compared to actual talent levels of .540/.460, demonstrating a bias toward .500.

The weighted regression model actually has the opposite short-term problem—it is under-regressed and gives too wide a spread of estimates in the early run of information:

elo_regression_comp4

However, this problem is fixable for weighted regression. In this example, we are using a simplified model that uses a static regression constant of 35 because that behaves more similarly to Elo, but a proper weighted regression would start off with a regression constant of 70, equal to the un-weighted regression constant, and would gradually approach 35 as the data accumulates. This is because when you only have one game, the weighting of past games is irrelevant, so the model should work identically to an un-weighted regression. Then, as the number of games accumulates and the weighting becomes more of a factor, the regression constant approaches its long-term value of 35.

So this bias in our weighted regression model is only because we are setting it up to emulate Elo, and in doing so we introduce a bias that mirrors that of Elo, albeit in the opposite direction. Done properly, a weighted regression will regress properly throughout and will start off behaving more like an un-weighted regression before gradually shifting toward behaving like Elo.

The slowness of Elo to adjust to the initial burst of data is a widely appreciated shortcoming of Elo and one of the bigger challenges of implementing the ratings. As a result, most people who use Elo make some kind of adjustment to how the ratings begin, such as the use of provisional ratings, a higher starting K factor, or starting values determined by projections and ratings from the previous season so that teams don’t start the year rated as .500 teams.

As a result, the actual ratings you get from Elo are dependent on the initial assumptions you make. For example, say a team wins 95 games. If you start every team with the same rating, at the end of the season that team will end up rated somewhere in the neighborhood of an 89- or 90-win team (the exact rating will depend on the order of those wins since Elo weights for recency). If we give the team an initial rating of a 90-win team, however, then its final rating jumps by about 3.5 wins.

You can also apply these assumptions to a regression model to give a better starting point for your estimates, but it’s not required to offset a bias in the ratings since regression is already designed to adjust its weighting to account for how much data you have.

General Principle No. 6

Elo does not regress its estimate toward the mean, and as a result gives an estimate that is biased toward the extremes in the long-term. During the initial burst of data, however, Elo moves too slowly and is over-regressed, meaning it is biased toward .500. This bias is strong enough that Elo can require initial assumptions about each team’s talent at the start of the season.

Strength Of Schedule

Most of what we’ve gone through above leans toward regression being, if not better, then at least easier to use than Elo. However, Elo has a very important advantage to balance out these concerns: It has a built-in strength-of-schedule adjustment. In fact, the entire calculation of Elo ratings is based on how strong your opponent is, which makes sense given that it was developed as a universal rating for chess players, where a game against an amateur at your local club has to factor into the same rating as one against a Grandmaster.

In major league baseball, teams have similar enough schedules, and the spread in expected winning percentage between teams is small enough, that you mostly can get away with not using a strength-of-schedule adjustment, just like you can mostly get away with using aggregated season stats rather than weighting them on a day-by-day basis. Nonetheless, it is still better to use the adjustment than not, all else being equal.

This is especially important in that it makes it easier to incorporate playoff games into the ratings. A team that wins half its games in the postseason generally has performed better over those games than a league-average team, but without adjusting for the fact that it is exclusively playing strong opponents, you can’t really tell that. It just looks like the team went .500 in those games.

Because of this, it is awkward to incorporate playoff results into a regression-based talent estimate unless you put in the extra work to add that adjustment. Elo naturally incorporates these games just as easily as any other.

General Principle No. 7

Elo’s main advantage over regression to the mean is that Elo inherently accounts for strength of schedule. This advantage is most apparent in the playoffs, where teams play exclusively against other strong teams.

Conclusions

Elo and regression to the mean both aim to estimate a team’s talent level, and the two behave pretty similarly in the long term if you weight the observations in the regression model for recency. They take different mathematical approaches, however, which give each specific strengths and weaknesses.

Regression is designed to properly weight observed data and give an unbiased estimate of talent based on that data. Because it adjusts itself to the amount of available data, it can be extremely simple to calculate and take advantage of limited summary data, or it can be adjusted to take advantage of more detailed data. These adjustments are more straightforward than for Elo because they are the basis of the math behind regression, whereas adjusting the weighting or accounting for biases can be much more complicated for Elo since those factors are incidental to its calculations.

Elo is designed to properly account for the strength of your opponent and to constantly react to new information. This gives it an inherent weighting for recency, though that weighting can be unintuitive and difficult to control if you want to fine-tune it, and it can be a challenge to balance the need to react to new information with the ability to take advantage of a growing volume of data. The basic Elo formula leads to biases that need to be accounted for, which can make it less user-friendly than regression, but the strength-of-schedule adjustment is a powerful advantage that automatically addresses one of the more difficult challenges of regression.

References & Resources


Adam Dorhauer grew up a third-generation Cardinals fan in Missouri, and now lives in Ohio. His writing on baseball focuses on the history of the game, as well as statistical concepts as they apply to baseball. Visit his website, 3-D Baseball.
4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Michael Bacon
7 years ago

I played in my first rated United States Chess Federation Chess tournament at 20 years of age in 1970 after my Baseball career, such as it was, ended. Arpad Elo’s system changed Chess. When two players meet the first question is invariably, “What’s your rating?”
I found this article fascinating. Thank you!

J. Cross
7 years ago

Wow. Great stuff! I’d be interesting to see simulations with changing true talent and to try to get a sense of what rate of change in true talent ELO assumes.

MGL
7 years ago

Fantastic article! One of the best I have ever read.

I second Jared above. Would like to see more work on changing talent. Is there any way to estimate change in talent from game to game from the empirical data?

Would also like to see how to incorporate SOS in projection model. Do we simple change the w/l record of a team to calculate an “effective w/l record” by (somehow backing that out of a log5 method)? Do we do that on a game by game basis similar to what Elo does (I’m talking about in a “regression to the mean” model)? IOW, if a team wins game N, rather than assign it 1 win, we assign it .93 wins or 1.02 wins or whatever, depending on the estimated true w/l record of the opposing team? To do that, do we just use log5 using estimated true WP of both teams and then “team estimated WP minus expected w/l” for the “effective” w/l for each game?

So say expected WP based on log5 is .55. Our team has estimated WP of .52. So our opponent is worse than a .500 team. The residual w/l is .03. So if we win, we only get credit for .97 wins. If we lose, we get credit for -.03 wins.

One reason this article is so great is that it is written and explained so well. I love the “general principle” sections and I love the simulations to test the various precepts.

BigChief
7 years ago

As someone who really struggles with technical writing, I have a great appreciation for good technical writing and this article has me blown away. I honestly can’t even comprehend how you you were able to explain such a math heavy subject in such detail while also making it concise and easy to read.