It turns out the correct amount of regression to use is roughly half the original regression constant (actually n/(2n-1), where n is the maximum effective weight of the sample based on the decay factor, which is close to 1/2 as long as n isn’t too small).
There wasn’t space in the article to lay out the math behind that statement, but I went through it in a comment on TangoTiger’s blog for those interested.
The n/(2n-1) formula works well for creating a regression model that emulates Elo, which was the purpose of the previous article, but for practical purposes, the impact of weighting past results on the mathematics of regression to the mean is a bit more complicated than that.
There are two important caveats that are critical to understanding why the n/(2n-1) formula isn’t necessarily what we want to use in practice.
First, n/(2n-1) is a long-term limit. Applying an exponential decay to past results places a cap on the total effective sample size. No matter how much data you have, the total weight will never go above a certain point (because this type of weighting creates a geometric series).
As the effective sample size approaches the limit n, the regression constant approaches the limit C * n/(2n-1), where C is the original regression constant. With the sample sizes we typically work with for projecting baseball teams or players, though, the regression constant generally doesn’t have enough time to reach the long-term limit.
Second, the math behind that formula relies on the simplified assumption that talent levels are constant for each team or player in the sample. In reality, the need to weight past results implies the underlying talent levels are changing, which can have additional implications on the regression constant that make this a lot messier. This wasn’t an issue with trying to emulate Elo, which gives an un-regressed estimate, but for creating an actual regression-based model, it will be.
This second point is especially important. In the math linked in the comment on Tango’s blog, the takeaway is that the regression constant is linked closely with the amount of variance in the sample. The reason the regression constant changes when you weight for recency is that doing so distorts the amount of variance.
This is important because the complications involved with changing talent levels also center around distorting the amount of variance in the sample. This means that, while we could adjust the n/(2n-1) formula to work for any sample length and not just as a long-term limit, it would still have problems due to the distortions introduced by changing talent levels.
General Principle No. 1
The regression constant is closely related to the amount of variance in your sample. Factors that can affect the variance, such as weighting past results or changes in talent levels over time, can produce corresponding changes in the regression constant.
Estimating the Regression Constant
The basic method to calculate the regression constant is to split your sample into two halves and measure the correlation between them. If you’ve ever seen an article on how long it takes particular stats to “stabilize” (such as Russell Carleton’s or Derek Carty’s) the author was likely doing something along these lines.
There are a few other ways to do this besides just splitting the sample in half and running a correlation (such as intraclass correlation, Kuder-Richardson-20/KR-21 and Cronbach’s alpha*), but the general idea is the same: You use either correlation or variance to calculate a reliability score for your sample, and the reliability tells you how much to regress. In general, the reliability goes up as your sample size increases, and the regression constant represents the point at which the reliability of the sample reaches 0.5.
*I believe Carleton’s second article underreports the plate appearance thresholds by a factor of 1/2. It appears from the article that he interprets the reliability of a 500 PA sample as the split-half correlation with 250 PAs in each half, but the formulas he is using—KR-21 and Cronbach’s alpha—both give reliability for the full sample length, not half. If that is the case, it would help explain a noted discrepancy between Carleton’s and Carty’s numbers.
If terms of correlation, the reliability score (for half the sample length) is the correlation between the two halves of your sample, and the reliability will be 0.5 when the number of observations in each half equals the regression constant. In terms of variance, it is the proportion of total variance in the sample that comes from variance in talent rather than random variance, and the reliability will be 0.5 when an equal amount of variance comes from both talent and random variation.
The fact that we can use correlation or variance interchangeably is helpful because one framework might be better suited to a particular discussion. In our case, focusing on the variance of talent in the population will help us understand how the distortions in variance caused by changing talent levels can impact the regression constant.
General Principle No. 2
The regression constant represents the number of observations you need for the reliability of your sample to reach 0.5. The reliability of the sample can be expressed (and calculated) in terms of either correlation or variance, and the two approaches can be interchanged depending on which is more convenient.
Variance in the Sample
The relationship between variance and regression is based on a simple concept: The wider the distribution of talent across the league, the more players you have whose actual talent level is a given distance from average. And the more players whose true talent is a given distance from average, the more likely an observed performance at that level is due to talent rather than random variation.
As a result, a wider spread in talent translates to a lower regression constant. Mathematically speaking, this increases the proportion of observed variance that comes from talent as opposed to random variation, thus increasing the reliability of the sample. And of course the reverse is also true; a narrower spread in talent will lead to more regression.
When talent is constant for each player throughout your sample, the variance in talent won’t ever change. What about when talent levels start changing from day to day, though?
To start, let’s look at a simple hypothetical league of 11 hitters whose true talent in on-base percentage is spread evenly from .300 to .350. This is obviously an unrealistic distribution for major league baseball, but keeping things simple will help us see what happens when we introduce random changes in talent.
Now, let’s say that each hitter’s talent level changes randomly each day, and that those changes follow a normal distribution with a standard deviation of .001. This doesn’t change much on a day-to-day level, but what happens as these small changes start to add up?
To show this, I’ve simulated 100 days for our hypothetical league:
|Player||Original||After 1 day||After 100 days|
The small changes become more substantial over time, but more importantly, something happens to the spread in talent across the sample. The variance of true talent started at 0.000275, but after 100 days of random changes in talent, that jumps to 0.000470.
Of course, with an 11-player sample, that jump could just be random chance, but even if we repeat the simulation with more players, we see the same thing. A jump to 0.000470 is a bit higher than expected (the expected variance in talent after 100 days under these conditions would be around 0.000375), but the fact that it increases remains.
This is because the original talent levels and the random changes in talent are uncorrelated. When you combine two uncorrelated variables, their variances also combine. In other words, we would expect the spread in talent after 100 days to approximately equal the variance in original talent (0.000275) plus the variance of 100 days worth of random talent changes (100 * 0.001^2 = 0.0001).
This is a problem. The implication is that, if changes in talent are uncorrelated with talent levels, the spread in talent across the population should be constantly increasing. Which, if you think about it, makes sense. If the players at the extreme ends of the talent spectrum are just as likely to get even better or even worse as they are to move toward the center, then those extreme ends will drift further and further apart as half the top players keep getting better and half the bottom players keep getting worse.
But this doesn’t match what we see. The spread in talent in the majors doesn’t appear to be constantly increasing like we’d expect from this model.
Our model, then, needs an update. We can correct for the constantly increasing spread in talent by constantly resetting the variance back to its original level. This is done by moving each player’s new talent level slightly toward the center of the distribution after each talent change.
This has the effect of creating a slight negative correlation between a player’s talent level and subsequent changes in talent—that is, players at the top end of the talent spectrum are more likely to fall back toward the center than to continue moving further away—which cancels out the effect of the additional variance.
|Player||Original||After 100 days||Variance Corrected|
This sounds similar to regression to the mean itself, and in fact, the calculations to offset the growth in talent variance are done the same way as the calculations used in regressing observed results. However, regression to the mean is a purely statistical phenomenon that occurs even without changes in talent. It tells us that a player’s true talent level tends to be closer to the mean than his observed performance, not that his talent level itself is changing in any way.
What this suggests is that, given that talent levels are changing from day to day, the underlying talent levels themselves will also tend to drift toward the center going forward.
General Principle No. 3
When there are changes in talent over time, one of two things should happen: Either the spread in talent across baseball will increase steadily over time, or individual player talent levels will tend to drift back toward the center. If there is no evidence of the former occurring, then we can assume the latter must be true.
How Talent Changes Affect Variance
When we look at observed performance, there is a wider spread in performance over the short term than the long term. For example, a month into this season, there were a handful of hitters with wOBAs over .500, and another handful of everyday starters with wOBAs under .200. By the end of the season, those extremes will have narrowed considerably as the players on the extreme ends of the spectrum drift toward the center.
If true talent also tends to drift toward the mean, then we would expect to see something similar happen with the underlying talent levels. And if we take the average talent level for each player in our example over the entire 100 days rather than just looking at talent one day at a time, we do indeed see the spread narrow:
|Player||Original Talent||Average Talent|
The variance in talent among players stays at .000275 for any particular day, but when we combine the 100 days and take each player’s average talent level over that span, the variance shrinks. Even though the variance in talent across the league remains the same from one day to the next, the players who are near the top or bottom of the league in true talent on any given day are more likely to be less extreme throughout the rest of the sample, on average.
As these players drift back toward the center and new players replace them at the extreme ends of the spectrum, nobody averages out as wide as the day-to-day extremes over the whole sample. The longer you observe the sample and the more time you give players’ true talent levels to change, the smaller the spread gets between them.
This reaches the crux of how changing talent levels impact regression. Decreased talent variance means lower reliability, and lower reliability means more regression.
In our example, the regression constant for our 11-player population would be approximately 800 PA. (The random binomial variance after 800 PA is ~.000275, which means that’s the point where the total observed variance is half random and half true talent variance.) But that is only if we are projecting talent for the immediate future. If we are instead projecting talent over the next 100 days, where we expect the variance in talent to tighten, suddenly we have to add ~850 PA of regression (the point where random binomial variance equals .000257).
The regression constant is generally treated as the same value whether you have one plate appearance or a thousand (hence the term “constant”), which works well when talent levels never change. When talent levels are changing from day to day, though, that no longer holds.
General Principle No. 4
The regression constant is generally treated as a constant value regardless of how much observed data you have, but changes in talent over time can cause the regression constant to increase as the variance in talent across the population shrinks.
Calculating Changes in Variance Over Time
We know that when talent levels change from day to day, the spread in talent across the league can shrink as your sample length grows. But how exactly is that relationship defined? How do we know how much the variance in talent shrinks?
It depends on two factors: the length of time covered by your sample, and how much talent changes from day to day (as measured by the correlation between talent levels from one day to the next). The longer your sample length, or the more talent changes from day to day, the more the spread in talent will narrow.
Once we know these two factors, we can calculate how much the spread in talent diminishes using the following formula:
Again, there isn’t space in this article to cover why this formula works, but I’ve written a supplement going through the math for those interested. If you prefer, you can also download the mathematical explanation in PDF form along with some R code to test the formula against simulated data.
In our 11-player example, the day-to-day correlation in talent is about 0.998. If we plug d=100 and r=0.998 into the above formula, we get that the variance in talent after 100 days should be about 94 percent of the original variance in talent. That comes out to about 0.000259, or right around what we got in our sim.
General Principle No. 5
The amount that variance in talent is reduced over time depends on two factors: the length of time covered by your sample and the correlation between talent levels from one day to the next.
Estimating the Day-to-Day Correlation
This formula allows us to calculate the expected variance in talent at any sample length, which in turn lets us adjust the regression constant to any sample length. If we have data covering a full season but want to project talent levels for the immediate future, we can convert the variance in talent over 180 days to the variance in talent over one day and then recalculate the regression constant based on the new variance.
In order to use the formula, though, we need to know the day-to-day correlation for talent. In practice, this value can be tricky to estimate since we don’t know how much talent is changing from day to day, but there are a couple ways we can do it.
One is to take samples of varying lengths and see how much the variance in talent (which we can estimate using the same reliability scores used to estimate the regression constant) diminishes over time.
As an example, I took the on-base percentage for all player-months from 2000-2016 where a hitter had at least 90 PA, and from that I get that the variance in OBP talent across the population is 0.000962. Then, I took all player-seasons over the same span where a hitter had at least 500 PA, and from that I get an OBP talent variance of 0.000910.
Going from d=30 to d=180 (the approximate number of days in a month and a season) lowers the variance in talent by a factor of 94.54 percent. That corresponds to a day-to-day correlation of 0.9989, where the variance in talent for one day is 0.000974, the talent variance over 30 days is 98.86 percent of the one-day value, and the talent variance over 180 days is 93.44 percent of the one-day value.
(This takes some playing around with the formula—it’s kind of messy to solve for two values of d, but you can set up the formula in Excel, R, etc. to quickly guess-and-check values and home in on the solution.)
Another way to estimate the day-to-day correlation is to measure the correlation between samples at varying intervals.
For example, if I take the monthly OBP data and find players with consecutive qualifying months, I get a correlation of 0.2506 for OBP one month to the next. If I then look at the same players two months apart, the correlation drops to 0.2432. Adding an extra 30-day gap between the samples lowered the correlation by a factor of 0.9704, which is equivalent to lowering it by a factor of 0.9704^(1/30) = 0.9990 each day. This value tells us how much the underlying talent levels correlate from one day to the next.
Using our two methods, we get r=0.9989 and r=0.9990. I should warn you, it is pretty lucky that our two methods yielded such similar values. You might occasionally get an estimate of 0.9950 using one method or set of data and r>1 using another (r>1 is clearly wrong, but sometimes you will run across something like the correlation actually going up when you go from one month apart to two months apart due to randomness in the data).
This is largely because we are dealing with ratios of very small numbers, and a small amount of randomness in the measurements can have an amplified effect. You can mitigate the randomness somewhat by looking at the data in as many ways as possible, such as comparing the correlation at several different intervals or looking at the variance at several different sample lengths, but you’ll inevitably have to deal with a degree of uncertainty in these estimates.
General Principle No. 6
The correlation of talent from one day to the next is key to calculating how much the spread in talent narrows over time, and thus to calculating a regression constant for varying sample lengths. This value can be tricky to estimate precisely, but we can get a general idea by evaluating how the variance or correlation of observed data changes over time. For many baseball stats, something on the order of 0.999 should be relatively close to the right value.
The regression constant is closely related to the amount of variance in your sample—the greater portion of that variance that comes from the spread in talent, the lower the regression constant, and vice versa.
We saw in the Elo vs. Regression article that weighting past results lowers the regression constant. This is because weighting past results lowers the random variation in the data, which in turn increases the proportion of overall variance that comes from talent.
However, weighting past data also implies that the underlying talent levels are changing over time (or, alternatively, that you are weighting for no reason). This also affects the variance, but in the opposite direction of weighting. That is, while weighting past data increases the proportion of variance attributable to talent, the changes in talent implied by weighting decrease that proportion.
This dampens the effect of weighting on the regression constant. The interaction of these two effects is complex and is not covered in this article, but the impact of variance discussed here is important to understanding that interaction.
That means this is still an intermediate step in understanding how changes in talent impact the math of regression to the mean. Before we can tackle the full effect, we need to understand why there is an effect and what factors contribute to this effect.
The most important factor in these calculations is the correlation of talent from one day to the next. This allows us to translate the day-to-day changes in talent into a quantifiable impact on variance, which in turn lets us measure the impact on the regression constant.
References & Resources
- Adam Dorhauer, 3-D Baseball, “Math Behind Regression Talent Changes”
- Github, “Math Behind Regression with Changing Talent Levels”
- Adam Dorhauer, The Hardball Times, “Elo vs Regression to the Mean: A Theoretical Comparison”
- Russell Carleton, FanGraphs, “525,600 Minutes: How Do You Measure a Player in a Year?”
- Tom Tango, Tangotiger Blog, “Regression v. Elo” comments, particularly those by Jared Cross
- Wikipedia, “Geometric Series”
- Derek Carty, Baseball Prospectus, “Resident Fantasy Genius: When Hitters’ Stats Stabilize”
- Harry Pavlidis, The Hardball Times, “It makes sense to me, I must regress”
- Russell Carleton, Baseball Prospectus, “Baseball Therapy: It’s a Small Sample Size After All”
- Tom Tango, The Book Blog, “Updated r=.50 numbers”