The 2014 season is in the books. The San Francisco Giants once again reign as the World Series Champions. Most baseball people are looking toward the offseason and 2015. Projections are key to our understanding of both the offseason and upcoming 2015 season. There are a lot of systems to choose from, and if you’re like me you have used all at one point or another, often interchangeably. We should, however, be sure we have a good understanding of each system and how they actually work and perform. I will look back at 2014 and evaluate which projection system can be crowned the 2014 champion. First, a little background on each of the projection systems I examined.
Most projection systems work largely the same way, with only minor variations. For instance, most use three to four seasons of data to calculate their forecast. However, there are nuances to each that make them unique. FanGraphs does an excellent job explaining the differences among the projection systems here. I will briefly summarize.
Created by Tom Tango, Marcel is the simplest among the projection systems. Marcel uses only major league data, giving heavier weights to more recent seasons. It takes into account age and regression towards the mean. Marcel does not project players with no major league experience. Marcel gives explicit instructions to assign league average projections to unprojected players. Because of this, Marcel projects far fewer players than the other systems.
Created by Dan Szymborski, ZiPS uses weighted averages of the previous seasons. It takes into account batting average on balls in play when regressing player performance. It adjusts for age by finding historical player comparisons.
Created by Brian Cartwright at The Hardball Times, Oliver also uses weighted averages to project players. Oliver differs in that it calculates major league equivalencies by taking in the raw numbers and adjusting based on park and league.
Property of Baseball Prospectus and developed by Nate Silver, rather than using weighted averages, PECOTA uses a system of historical player comparisons to calculate its projections.
Created by Jared Cross, Dash Davidson and Peter Rosenbloom, Steamer also uses a system of weighted averages. Steamer differs in that it weights different components differently and regresses some more heavily than others. Steamer does not explicitly take aging into account.
It is worth mentioning where these data came from. I downloaded the 2014 actual data via FanGraphs and removed all players who pitched that season, apologies to Adam Dunn. ZiPS and Oliver both came to me via Will Larson and the Baseball Projection Project. Due to rounding issues with third party data sources, Jared Cross himself provided me with the Steamer projections. Marcel forecasts, no longer produced by Tom Tango, are unofficially maintained by David Pinto who makes them publicly available. PECOTA was simply downloaded from Baseball Prospectus.
The first step was to find a common metric to look at. Given the common statistics projected by all systems considered, wOBA seemed like the obvious choice. Some projections were kind enough to include sac flies, but most did not, leaving us with the just walks, hit by pitch, singles, doubles, triples, home runs, and plate appearances. Using the 2014 wOBA coefficients, I arrived at this simplified formula:
(BB(.69) + HBP(.72) + S(.89) + D(1.28) + T(1.64) + HR(2.14))/(PA)
Merges and Missing Players
Next I had to assign unique identification to all players. This is always an arduous task, ahem Chris Young, but I was able to match most players. There was a distinction to be made between players who simply weren’t projected and players I failed to correctly match to the actual 2014 data. The percentage of total plate appearances that were not matched was pretty small. Players who were not projected or not matched were given a wOBA projection 20 points below league average. This is close to the actual mean for that subgroup of players. The Marcel projections unprojected/unmatched players were given a projection of league average performance. This table summarizes the results of the merges.
|Projection Systems, Merges and Misses, 2014|
|System||Players Unprojected||PA per player||PA Unprojected||PA total||Share||Given|
The fact that more players were missing from PECOTA could be my fault for not matching well enough; it could also be that these players just weren’t projected by PECOTA. Take it for what you will, but this is a potential source of bias. The overall portion of plate appearances not matched was so small that whatever we projected these players at hardly affected the results.
Correctly merging my data sets was the bulk of the work, yet there was still more to be done before we could have fun with it. To compare systems we need to adjust them to a common mean. We only care how a player performed in the context of the projections league average. If Mike Trout was projected to hit .425 in a league with a projected .340 mean we want to count this as the same as if he were projected to hit .400 in a league with a projected .315 mean. In 2014, disregarding pitchers, the properly weighted league wOBA was .315. To do this I first calculated the population mean of players who actually played in 2014, weighting by their 2014 plate appearances, then I filled in the missing players with a projection 20 points below the projected weighted mean. Then the mean was recalculated and scaled up to .315.
And now the part you’ve been waiting for, the results. This part was simple enough; I calculated the mean absolute error, weighted by plate appearances, for each of the five systems. This is how they fared projecting the league population.
The projection systems all did pretty well and, as usual, are relatively close together. ZiPS takes home the crown with the lowest mean absolute error. While Marcel comes in last, the result demonstrates Marcel’s original intended purpose; it shows us that a simple projection system will get us most of the way there in the aggregate.
What is more interesting however, is how the models performed on different subsets of players. I have split the players into groups based on experience and age as well as binary identifiers breakout and breakdown.
At the heart of it, what to do with past performance is the question all projection systems are trying to solve. Thus it is natural to group based on career playing time. I bunched players into three categories: rookies (0-300 plate appearances), middlers (300-1,800 PA), and veterans (1,800+ PA).
The rookies are interesting because for the most part these projections were going off minor league data, so we want to see who best translated minor league performance to major league performance. Steamer was able to break away from the pack here and projected these unknown quantities quite impressively. On the whole, Steamer was not far off its overall performance. Meanwhile, due to the fact that it does not take into account minor league data at all, Marcel unsurprisingly over-projected this group of players and performed the worst.
The middling players are those with some major league experience, but not the full range that most of the projections like to use. PECOTA did well here, perhaps because history is a better indicator than the other systems’ algorithms on such a small sample of major league data. Again, without the full three seasons of data available, Marcel lagged behind the pack.
The veterans have over 1,800 career plate appearances and for the most part have played the full three seasons to be used by the projections. This is where Marcel got to shine. Marcel slips in right into the middle of the pack here tied with Steamer. Notice now the best, ZiPS, and the worst, PECOTA, are only .009 away from each other compared to .064 with the Rookies and .023 with the middlers. When players rack up larger sample sizes, we can project them much more accurately.
No one system stands out above ZiPS, our overall winner, in any bracket. When it comes to experienced players you can’t go wrong, use whichever system is available to you, but when it comes to the rookies steer clear of Marcel and opt for ZiPS if you can.
Age is another interesting subgroup to look at because different projections handle player aging differently. PECOTA and ZiPS rely on historical comparisons while Marcel uses an age factor. Steamer does not explicitly take aging into account at all. I examined how the projections did for really young players, really old players, and each age in between.
|Mean Absolute Error|
|24 and Below||0||0.0297||0.0311||0.0315||0.0288||0.0294|
|30 and Above||0||0.0252||0.0252||0.026||0.0249||0.0243|
|24 and Below||0.3108||0.3066||0.3135||0.3133||0.3100||0.3135|
|30 and Above||0.3168||0.3215||0.3194||0.3191||0.3205||0.3201|
There are a few things to take away from examining the breakdown by age. The first is that projections do better on older players. Given our results above, this should come as no surprise. It appears that for all systems the challenges of dealing with player aging are more than outweighed by the advantage gained by the added data to draw on.
Among the oldest players ZiPS takes the cake with Steamer coming in second and PECOTA and Marcel tied for third. There seems to be no direct connection with the performance among these players and the specific method used to account for aging. ZiPS did the best by using historic equivalences, but Steamer came in second without explicitly looking at age at all. PECOTA, also using historic equivalences tied exactly with Marcel who uses a simple age factor.
Interestingly Oliver, which does better than Marcel overall, struggled with the extremes. Oliver came in dead last among the youngest and among the oldest players. This indicates that perhaps Oliver needs to revise the way it takes into account age.
Overall ZiPS does the best on older players and Steamer does the best on youngsters. However, the difference isn’t large enough for me to want to go out of my way to use two different projection systems for young and old players. I’ll stick to our overall winner and use ZiPS.
Breakout and Breakdown Players
Another thing we might be interested in is how each system did at predicting the extremes. I examined how each system fared in projecting breakout players. I defined a breakout player as a player whose wOBA increased by 30 points or more from 2013 to 2014. These could be young guys coming into their own or veterans coming off of an injury-plagued season.
All projections will be cautious to (or won’t at all) project something drastically different from what they have done in the recent past. Thus, none did a very good job predicting a breakout. This is one area where it may be better to use subjective measures.
We do, however, see two similar systems that use historical equivalences, ZiPS and PECOTA, do well. However, Steamer did the best without using historical equivalences. With both the systems that incorporate historical equivalences doing well we might want to start to think there is something to equivalency systems doing better at predicting large swings in performance.
Similarly, I looked at how each system did on the opposite type of players, players who experienced breakdown seasons. A player was flagged to have a breakdown season if his 2014 wOBA had decreased 30 points or more.
Again we see the same three systems as the top performers, this time with ZiPS taking the number one spot, and PECOTA as the runner up. This is more evidence that when looking for big advances or declines in players, using equivalences might be the way to go.
As a whole, the systems do a slightly better job at predicting breakdowns than breakouts, but not by as much as I would have expected. Intuitively it makes sense that a breakdown is easier to predict than a breakout, but in reality both are challenges for algorithms that tend towards the mean. It appears that systems like ZiPS and PECOTA will do slightly better for this type of player. If you are looking at players you think are about to do something unusual, I would tend toward ZiPS or PECOTA.
One note in defense of Oliver is that Oliver’s projections have the largest variation, which may be preferable to some when choosing a projection system. Think about if you have one system that projected all players to be league average and another that varied. If they both had the same absolute error, you would prefer the one that includes more variation; the one with no variation essentially tells you nothing (and if you adjust to league average like I did, it does tell you nothing). This final table summarizes the variation of the adjusted projection system. The mean for all the systems was .3152 after being adjusted to league average, what is important here is the spread in the projections.
|Adjusted Projections Summary|
Overall projection systems may be improving as their creators revise their algorithms. In Tango’s study on 2007-2010 projections Marcel was right in the mix with the others, but four years later we see some more separation. No matter how you slice it, all these projections do a fine job and the differences between them are subtle, but not negligible. We saw that among inexperienced players Marcel struggles and should be avoided. For players who have racked up lots of at-bats and years, it’s hard to go wrong, but ZiPS performed the best. Again ZiPS proved to be the best when looking at breakouts and breakdowns with PECOTA also doing well. This gives credence to the idea that historical equivalences are especially useful for predicting players who are about to break from a normal trajectory.
Given all of the above we can decisively say that in 2014 ZiPS did the best job. It performed the best overall and in most of our subsets while never stumbling into the bottom half. You can’t go wrong with any of these systems, but as we look ahead to 2015, I’ll be using ZiPS.
References and Resources
- ZiPS and Oliver provided Baseball Projections Project.
- PECOTA provided by Baseball Prospectus.
- Marcel provided by David Pinto.
- Steamer provided by Jared Cross.
- The general structure of my evaluation was based on a 2010 study on projections by Tom Tango.