Projection Roundtable (Part 4)

Note: You can read Part 1 here, Part 2 here, and Part 3 here.

The cool thing about looking forward towards the future is that we don’t really know what it holds. We can fantasize about peace in the Middle East or the Pirates making the playoffs because we can never predict what will happen. And yet, in any area, whether it be political relations or baseball, there are people trying to do just that.

Since The Hardball Times is not a political website, we’ll leave that alone and instead let’s focus on baseball. Baseball projections are a multi-million dollar business in the U.S., with huge demand for fantasy information. Everyone from fantasy baseball players to major league teams relies on getting the best and most accurate projections to win. And yet there’s still so much to be discovered and understood about how to best predict the future.

To help facilitate some discussion of this issue, I gathered most of the best forecasters in the business for a round table that will run Monday through Friday in five parts. Besides me (David Gassko), the participants were:

Chris Constancio, who writes a weekly column for The Hardball Times and publishes his projections for minor leaguers at FirstInning.com.

Mitchel Lichtman, who is a co-author of The Book – Playing the Percentages in Baseball. He has been doing sabermetric research for almost 20 years, and for two years was the senior advisor for the St. Louis Cardinals.

Voros McCracken, who is most famous for his invention of Defensive Independent Pitching Statistics (DIPS), and is a former consultant for the Boston Red Sox.

Dan Szymborski, who is Editor-in-Chief of Baseball Think Factory, and publishes the ZiPS projection system.

Tom Tango, who co-authored a baseball strategies book called The Book – Playing The Percentages In Baseball, which is available at Inside The Book.

Ken Warren, who started doing player projections and studying baseball statistics in 1992, when he was totally unsatisfied with the projections that STATS was publishing in their annual handbook, and Bill James’ claim that he couldn’t or wouldn’t project pitchers until they had 500 major league innings. Since then he has been continually adding little bits of sophistication and complexity to his system, and has noticed small improvements in accuracy over the years.

David Gassko: OK, let’s try to steer this conversation in a slightly different direction. This is a roundtable after all, and here’s the question surrounding projection systems that has interested me most, so I’m very interested in hearing all your opinions on this. What do you think of using similarity scores in projections? I have my own thoughts here, but I am very open to hearing various opinions on the subject. Chris, of course, does similarity score-based projections, though I’m sure all of you have given the issue plenty of thought.

Mitchel Lichtman: It has always intrigued me as well. I assume that that is the primary basis for the PECOTA system, which I think is a good one. Ideally, the perfect projection system does that. After all, everything that happens in the future in baseball can be perfectly projected based on history, as long as we match up the “relevant and proper variables” in the model. The problem is the sample size issue. I am sure that PECOTA accounts for that to some extent, by incorporating regression to the mean or expanding the sample of players if need be to include those who are not all that similar to the subject player.

To some extent, all regression-based projection systems (like Marcel) are based on what happens to players with similar profiles, so that a system that primarily or exclusively uses similar players in history to project future performance is really doing the same thing, except it is based on the assumption that players of a similar ilk to the player in question will have a different career path than players in general. That may or may not be true. And even if it is true, you are still sacrificing sample size for that truth.

In other words, what if it is true that 34 year-old players who hit around 30 home runs a year and strike out around 100 times with 50 walks and are around 6’2” tall and 210 pounds have a unique career path? So you look at players who had similar profiles in component performance, age, height, weight, etc., and you see what they did in their 35th year. That is the essence of this kind of system, right? Again, if you had 100 such players with 50,000 plate appearances in that 35th season you would essentially have a near-perfect system. But you don’t. You have maybe 20 players. What if by sheer luck alone, 8 of them retired or got inured after that 34th year, so you “predict” that your subject player has a 40% chance of falling off the face of the earth? You may have come up with a bad projection.

Maybe it would have been better to use all 34 year-old players who were over 6 feet tall and over 200 lbs. Or maybe it would be better to use all players, period, and then adjust for age and height and weight (which is essentially what I and other forecasters do). Who knows? These are some of the potential weaknesses of this kind of system. First you must establish that the variables you are using to create your “similar players” are significantly correlated with career path, in order to justify your decreased sample size. Then you have to somehow adjust for the fact that you often have small samples of similar players.

So the bottom line to me is that it is an interesting concept, and as I said, in theory it is actually the best way to do projections (as I said, the best way to project anything in baseball or in other areas of life and science, such as the weather is by using similar historical data to “see what actually happened”). But there are definitely some methodological problems that need to be overcome. I have never seen a “similar players” type system detailed, analyzed, and tested, so I just don’t know. PECOTA is a black box and I don’t know of anyone who has used a similar methodology and described it in public. For now, I’ll stick to my regression-based system, which as I also said, is to some extent just a “similar players” based system in different clothes (or the “similar players” based system is a regression-based one in different clothes).

A Hardball Times Update
Goodbye for now.

Tom Tango: There’s no question that the single most interesting thing of using comparables is because it is fun.

Forecasting engines by themselves are deathly boring, because it is nothing but numbers on top of numbers. A similarity-based system let’s you put a face to the numbers. Humanizing a system is usually a good thing. And fans just love this stuff. So, even if it doesn’t add any accuracy, it’s a good thing to do.

Now, how far to take it? As Mitchel pointed out, how worried are you if your sample size is low? After all, if you are lucky to get Carlton Fisk in your sample, that does wonders for your career path.

But, you don’t have to go all the way with it. Let’s assume that you have a similarity-based system takes the 10 most similar players, and then adds the 10 most similar players of each of those 10 (so that you have a maximum of 100, but because of duplicates, you might get 75).

Let’s also assume that the similarity-based system has a Marcel-like monkey basis as well. Then, it’s simply a matter of weighting the humans and the monkey. For a Roger Clemens, the similarity-based system may say, “hey, this guy is really unique.” So, PECOTA may say that it’ll weight Roger’s humans at 5%, and the monkey at 95%. But, if you have someone more normal, the similarity-based system may weight the humans at 50% and the monkey at 50%.

So, if you get a Carlton Fisk in your set of comps for say Joe Mauer, then the similarity-based system may say “you know, Fisk really had a weird path, so much so that even though Fisk is comparable to Mauer, he’s an outlier.” So, the similarity-based system may decide to underweight the humans.

So you can make a whole bunch of adjustments that may mitigate the sample size issue.

The only issue with PECOTA as a similarity-based system is the percentile ranges. I don’t see how they can be trusted at all, when you have a rookie and an experienced pitcher, each with an ERA range that is around the same. The ERA range should be determined almost solely by your uncertainty level of your mean estimate. And, that uncertainty level is based almost entirely on the number of plate appearances or batters faced of your player. Guys like Randy Johnson and Brad Radke would be a little different because of their balls in play per plate appearance.

My guess is that PECOTA uses the comp players as the range in the estimate, without applying any uncertainty level.

Chris Constancio: I use similarity scores in my projections. I think this is a reasonable way to consider latent variables that might be important but I might not think to include in a regression or growth curve model. For example, maybe there’s something about stocky middle infielders’ development in their early 30’s that is noteworthy and could improve my forecast for Ronnie Belliard. Well, its good in a similar players’ development would be reflected in the forecast because I would never think to put a “stocky middle infielder” dummy variable in a regression equation because it’s not really going to help me understand most players’ development.

I personally make league, park, and age adjustments for a “baseline” forecast for minor leaguers. Then, depending on the quality of the pool of similar players, I adjust that baseline to reflect the similar players’ results in subsequent seasons. I can say that I built the system on past data (predicting 2004 and then 2005 performances) and I’m happy with this year’s set of predictions so far, but a more formal analysis will have to wait until this winter.

I agree with most of the ideas already shared by Tom and Mitchel. Using similarity scores is ideal in many ways but in practice it is filled with methodological traps. You need relevant players to form a good forecast, but you need to balance that with a need for a large enough sample. It’s not easy. I cannot emphasize enough how important the “relevant” condition is. The entire validity of the system hinges on the technique for identifying similar players.

A 21-year-old second baseman who has not played much ball beyond Double-A, for example, might receive a PECOTA forecast informed by guys who have been in the major leagues since age 19 (e.g., Gary Sheffield) and also guys who were playing in 1949 (e.g., Cass Michaels). Is that really a “relevant” sample? First, league translations are nice, but the types of players who get called up at as a teenager are systematically different from the players who are still plugging away in the mid- to upper-minor leagues several years later. Second, there are significant differences in how the game is played and how players develop (medical advancements, weight training, etc.) today in comparison to 50+ years ago.

I now only use player data from the past 10 years and compare players who are in comparable leagues at the same age. I have MLB data going back 100+ years in my database but I won’t touch it in my projections. Is it the only or best way to do this? Probably not, but these decisions have improved my forecasts and, just as importantly, I’m finally getting comfortable with the validity of the whole method.

The uncertainty estimates is a tricky issue. Similar player data can be used to generate probabilities of significant improvement (“breakout”) or significant decline (“collapse”). This should not, however, be the only info used in estimating confidence intervals. As Tom already said, the reliability of your initial estimate (playing time, mostly) should play a large role in determining the level of uncertainty with any prediction. I don’t think it should be the only consideration, however. I am struggling with integrating the two sources of information to form a meaningful confidence interval around any single prediction.

Mitchel Lichtman: Those are some nice comments and insights from someone who actually works with similar-player models as well as pure regression models.

Voros McCracken: I don’t like using similarity scores for projections, mainly due to the sample size issues. I think they can be quite valuable in teasing out the sort of relationships that can help improve projections, but in terms of using them to do the projections directly, I don’t think it is as well suited. I think the chances of it finding a real unseen and unknown relationship between two players is as low if not lower than just using random statistical flotsam. With the necessity of using relatively few samples (James used to use 10) that flotsam can come into play.

If there’s some unseen unknown variable that you think you need to include, than I think the solution is to go about seeing and knowing it, rather than using similarity scores and hoping it takes care of itself that way. You can use similarity scores to go about learning about it (in fact I recommend it), but I think with the actual projecting it’s better to know all the mathematical adjustments that go into the system.

I have similar objections to the use of neural nets for this as well: if the system is making an adjustment to a projection, the projector should know what that adjustment is, if for no other reason than when he’s asked why Player A’s projection is so high/low, he’ll know the answer.

Ken Warren: This issue is too technical/statistically oriented for me.

It seems to me that if you know the skill level of the player you are projecting, and you know the aging pattern of each particular skill, and you eliminate luck or randomness from that player’s past performance you will be able to derive an excellent projection. It seems as if there is little to be gained by comparing this player with other similar players (hopefully similar) from the past. In fact you may be introducing more error into the process because of using players that are somewhat similar but not totally similar, or by having too small a sample of players who are similar enough to be useful.

It seems that the premise behind this system is that “we don’t know how particular skills age,” so we will try to find similar players from the past and use those trends. I don’t buy the concept that we don’t know how each component skill ages. Once we know that, there is no advantage at all to using other players’ aging trends.

And how you measure “similarity”? Do you use actual AVG, OBP, and SLG or expected batting average, expected OBP, and expected SLG? There seems to still be a lot of people who still thing a pitcher’s ERA actually means something, even it is achieved with unusually high/low hit rates and strand rates.

Do you compare catchers with catchers or all players with similar hitting skills? Do you take defense into account? A below average defensive shortstop with a .690 career OPS will have a completely different career path than a great defensive shortstop with the same .690 OPS.

How about a first baseman with a .750 OPS (Travis Lee) and a shortstop with a .750 OPS (David Concepcion for example)? Are these players in each others “comparisons?” When looking for similar type players what are the criteria: ISO or SLG, OBP or OBP-AVG? How about contact rate (AB-K)/AB, plate discipline (BB/PA), batting eye (BB/K), or durability (PA/season)?

I would be very interested in comparing my “skill/age” based projections with one based on similarity scores. It would be interesting to see if there is a pattern in the players where the projections are markedly different and if there is any difference in the accuracy of either system in those players’ projections.

David Gassko: Okay. So the consensus on similarity-based models seems to be that they are interesting, and potentially useful, but so fraught with small sample size and comparability issues that they are currently of limited (or maybe even no) use. Interesting. I assume that both PECOTA and FiPRO, the only two similarity-score based systems that I know of, both first make a baseline projection, then a similarity-score projection, and then determine via regression how much each should be weighted in the final projection, and if the similarity-score projection is significant, well then, it has to be adding something. I do agree that there are ways to go, however, with figuring out just how to do it right.

Ken Warren: Wouldn’t the “similarity score projection” look better, the worse the baseline projection is? If you have a great baseline projection, then the “similarity-score projection” is going to seem less useful. I think that there is a lot more to be gained by improving the baseline projection than attempting to use similarity scores. Of course it probably possible to do both.

Mitchel Lichtman: I don’t think that the consensus is that they are of limited value. I don’t know enough about them to say one way or another. I said that ideally they are an excellent way of doing projections, especially with the aid of regression models. There are potential problems associated with them, but I don’t know whether the problems are outweighed by the advantages or not. I would have to see the exact methodology as well as some kind of “testing” or evaluation.

As far as I can tell, PECOTA is an excellent system overall, and from what I understand it is primarily a similar-player type system. I agree that there is only so much you can do with a similarity system (it is limited by the number of players available), whereas you can probably improve a regression-based system (for lack of a better word) almost ad infinitum. Theoretically you can “nail” a regression-based system whereas you are always limited by sample sizes with a similarity system. As I said earlier, the latter would be perfect if there were an unlimited number of players, but alas there is not.

Voros McCracken: Well basically I agree with Mitchel and Tom: the problem with similarity systems in using them to project is the sample size issue. Whatever information the similarity matches give you in excess of what a mathematically calculated projection would, it’s difficult to know whether that info is real or whether it’s just random statistical flotsam.

I think similarity studies are extraordinarily useful, but I think their use lies in the ability to use them to aggregate general information on trends. Those trends can then be incorporated into whatever mathematical system you use to project, or for whatever other purpose you might need a matched pair study for. I think if there’s a theory that stocky players age better, then it’s perfectly reasonable to use a similarity study to determine the extent to which that might be true. And then once that’s determined, it shouldn’t be that hard to insert the math back into the existing system.

Finally another more technical problem with similarity systems is that baseball statistics don’t represent the actual abilities we’re trying to assess. If a guy has a 6.5% chance of homering in every at-bat in a 550 at-bat season, he might hit 25 homers he might hit 41. I think it can be very misleading to assume that similar differences in stats for a player are developmental in nature rather than simply being probabilistic. With similarity scores, you’re either reducing the sample to the point where that can be an issue. So I think they are best used when you know you must tightly control all other statistics except one, and to address issues such as co-linearity and the like. But I don’t think they are as good at spitting out the ultimate projections for players.

References & Resources
xBA, xERA, xOBA and xSLG, speed rating and “contact rate” are all proprietary statistics developed at BaseballHQ.


Comments are closed.