If our predictions are good, most of the time we really don’t need to worry about updating them in season. If we predict a guy to put up a .350 OBP, and his OBP at the midseason point is .358, there’s not a lot of reason to wonder what he’s going to do the rest of the year.
But if a batter’s performance is wildly divergent from his projected performance, then the question is raised: What is he going to do the rest of the year? So how should we update his projection?
The right way
The correct answer is to just do the projection again, basically the same way you did it before, but factoring in age a little differently—you don’t want to age current-year results, although you may still want to age past seasons in your projection.
This, for any number of reasons, can be inconvenient; maybe it’s too time-consuming to update a projection every day. Or perhaps you don’t have the means to update the projection—most of us don’t have our own projection system at all, much less a system of major league equivelencies for minor league production.
So what are we to do?
The binomial way
In those articles, I mention that you can use regression to the mean to estimate a player’s true talent level, by taking a weighted average of the uncertainty (or standard deviation) of the player’s stats and the league. The example used was a hypothetical Albert Pujols who put up an OBP of .320 in his first 100 PAs. From a comment I made on the first article:
So to estimate uncertainty in Pujols’ OBP … we take .47/SQRT(5900) = 0.006 for our estimate of uncertainty in Pujols’ OBP. (As for where these numbers are coming from, check the appendix to The Book.)
We also need to estimate the uncertainty in our league average number: for right now, we’ll use .025. (Again, that’s coming out of the appendix to The Book.)
So then we take:
(.424/.006^2 + .356/.025^2) / (1/.006^2 + 1/.025^2) = .420
Or, we can do this slightly differently, and regress the year-to-date stats and the career stats separately, as well as regressing to the population mean:
The result: .420.
Basically, we don’t need to combine the current year stats and the past year stats when we regress to the mean; we can include them both separately.
We can use this to update a projection, if we keep in mind the commutative property of multiplication. In other words, our projection should already be regressed to the mean, and therefore regressing current year stats to the projected stats should give us results reasonably close to what we would get if we redid our projection, including our current year stats.
All we need is an estimate of the uncertainty of a projection, and we should be set.
Estimating uncertainty of a projection
This analysis will seem familiar to anyone who’s read Nate Silver’s article on Chipper Jones, where he uses the binomial method discussed above to project the odds of Chipper batting .400. It’s a brilliant piece of analysis that happens to be almost completely wrong.
The problem is that we can mean one of several different things when we refer to the uncertainty of a forecast. Remember:
Observed Performance = True Talent + Random Error + Bias
And that random error decreases as number of PAs increases.
So when we refer to the uncertainty of a forecast, it can take on two aspects: uncertainty of our estimate of true talent, and uncertainty in the amount of random error. When we update a projection in this fashion, we want only the uncertainty of our estimate of true talent. Silver used observed uncertainty of similar players, which captures the random error component.
The uncertainty of our estimate of true talent should be affected principally by only two factors:
- The number of observations
- The average uncertainty of the sample you are regressing toward
The latter we have a pretty good handle on. The former is a bit tricky, since we don’t know what’s going on inside of most projection systems. The one projection system we do know what’s going on inside is Marcels, and so we’ll use that as a proxy for all projection systems. Most projection systems (yes, even PECOTA) use a weighted average of past performace as the starting point for their forecast.
Let’s use Joe Mauer as an example. His PAs in past seasons:
- 2008 – 633
- 2007 – 471
- 2006 – 608
So we can estimate the number of PAs used in the forecast if we use Marcel’s weighting of 5/4/3, which we can reduce down to 1/.8/.6:
633 + .8 * 471 + .6 * 608 = 1374.6
Now, we can use that set of weighted PAs to estimate the uncertainty of our forecast, like so:
.44/SQRT(1374.6) = 0.012
Then we must regress to the mean. There are forecasting systems out there that don’t regress to the mean (some fantasy publications will include these sorts of forecasts), and they really aren’t worth paying attention to. What we do is combine the estimate of uncertainty in Mauer’s forecast and our average uncertainty of the population (which is .025) and we come up with 0.015.
So, for any projection of Mauer, we can use an uncertainty of .015 as our best guess. The more Marcels-like the forecast is, the more accurate that will be, but it’ll do in a pinch.
Putting it all together
Let’s say that Mauer’s preseason OBP forecast was .406. (That’s based on the Fantasy 411 averaging of several projection systems.) So far this season, he’s put up a .437 OBP in 382 PAs, with an uncertainty of .023.
So, to estimate Mauer’s current true-talent OBP, we get:
(.437/.023^2 + .406/.015^2)/(1/.023^2 + 1/.015^2) = .415
That’s a pretty healthy boost for Mauer. (Our new estimate of uncertainty is .013, if you were wondering.)
What about minor leaguers?
When we have a projection based on major league equivalencies of minor league production, we have added a source of error that we need to account for. Looking at the Davenport Translations for minor league players in 2008, and comparing them to the actual OBP of those translated players in the majors (after regressing both OBPs to the major league mean), we find a Root Mean Square Error of .018. What’s cool here is that RMSE is calculated in the same fashion as standard deviation, and so for an unbiased estimator, the RMSE should equal the standard deviation.
So how do we figure out the uncertainty for a minor league player? Remember that the standard deviation is the square root of the variance, and variances add.
For instance, at Triple-A Iowa last year, Matt Murton had a translated .342 OBP in 222 PAs. What’s our uncertainty? First we figure out the uncertainty for his observed OBP:
.44/SQRT(222) = .030
Then, to figure out our total uncertainty (including the uncertainty of our translation):
SQRT(.030^2+.018^2) = .035
So we should regress minor league stats further to the mean (leaving aside for the moment the question of which mean to regress to). This applies both to our estimate of forecast uncertainty and current season uncertainty. So you need to figure uncertainty of major and minor league PAs individually, add the MLE uncertainty to the minor league PAs, and then use both (along with the average uncertainty) to figure out the total uncertainty.
References & Resources
Compare the results to the updated ZiPS and PECOTA forecasts. In the case of Mauer, they seem to agree pretty well with the binomial method I talked about. I am not sure this holds as a general rule, and this requires some more study on my part.
Also, it should be noted that this model assumes the underlying projection is handling the regression properly. To the extent it isn’t, this won’t work.
The formula for averaging uncertainties is:
SQRT(1/(1/A^2 + 1/B^2)
In short, you’re taking the harmonic mean of the variance and taking the square root of that. You can, as noted for minor leaguers, include more than two uncertainties in the formula.