The cool thing about looking forward towards the future is that we don’t really know what it holds. We can fantasize about peace in the Middle East or the Pirates making the playoffs because we can never predict what will happen. And yet, in any area, whether it be political relations or baseball, there are people trying to do just that.
Since The Hardball Times is not a political website, we’ll leave that alone and instead let’s focus on baseball. Baseball projections are a multi-million dollar business in the U.S., with huge demand for fantasy information. Everyone from fantasy baseball players to major league teams relies on getting the best and most accurate projections to win. And yet there’s still so much to be discovered and understood about how to best predict the future.
To help facilitate some discussion of this issue, I gathered most of the best forecasters in the business for a round table that will run Monday through Friday in five parts. Besides me (David Gassko), the participants were:
Chris Constancio, who writes a weekly column for The Hardball Times and publishes his projections for minor leaguers at FirstInning.com.
Mitchel Lichtman, who is a co-author of The Book – Playing the Percentages in Baseball. He has been doing sabermetric research for almost 20 years, and for two years was the senior advisor for the St. Louis Cardinals.
Voros McCracken, who is most famous for his invention of Defensive Independent Pitching Statistics (DIPS), and is a former consultant for the Boston Red Sox.
Dan Szymborski, who is Editor-in-Chief of Baseball Think Factory, and publishes the ZiPS projection system.
Tom Tango, who co-authored a baseball strategies book called The Book – Playing The Percentages In Baseball, which is available at Inside The Book.
Ken Warren, who started doing player projections and studying baseball statistics in 1992, when he was totally unsatisfied with the projections that STATS was publishing in their annual handbook, and Bill James’ claim that he couldn’t or wouldn’t project pitchers until they had 500 major league innings. Since then he has been continually adding little bits of sophistication and complexity to his system, and has noticed small improvements in accuracy over the years.
David Gassko: Hi guys,
Thank you for agreeing to participate. Let’s dive in right away with the first question I have:
Projection systems seem to be evolving into more and more complicated creatures. I think the general public opinion is, the more adjustments, the better. But on the other hand, Tom has been advocating for years that there really isn’t much you can improve on beyond a basic projection system like the Marcels , which simply uses three years of data, regresses, and age-adjusts. So here’s the question: How much value is there in continually fine-tuning projection systems? Are we better off using simple projections and understanding their limitations rather than trying to come up with the perfect system?
Ken Warren: I would agree with Tom’s basic philosophy, however I think we do better by basing projections on component skills rather than traditional baseball statistics. Metrics such as batting average, on-base percentage, slugging percentage, runs scored, and RBI are heavily driven by luck (primarily batting average on balls in play—BABIP) and also the run-scoring environment a player plays in.
By using component skills you can age-adjust much more effectively because each skill tends to peak at different ages. Based on studying actual baseball data from 1992 on I have discovered the following trends:
Speed measured by Baseball HQ speed ratings – peaks at age 24
Health – measured by PA per season – peaks at age 27
Contact Rate – measured by K/AB – peaks at age 28
Power – measured by SLG-AVG (ISOP) – peaks at age 30 or 31
Plate discipline – OBP-BA (ISOD) – peaks at age 34
Batting Eye – BB/K – peaks at age 34
So in summary you can use the past three years of data, but you need to:
a) Normalize the past data by eliminating the luck effect of BABIP (by using xBA, developed by Baseball HQ, in place of average)
b) Age adjust each component skill separately because of the different age peaks of different skills.
As for pitchers I look at the three previous seasons and estimate IP, BB, K, & HR for the up coming season. I then use HQ formula for xERA (which totally ignores luck fluctuations in the hits/IP ratio) to predict a pitchers expected ERA. One of the biggest mistakes that a lot of knowledgeable baseball people still tend to make is the assumption that there is some particular pitching skill in the percentage of balls put into play that become hits.
I have not always used this methodology and feel that it is worth noting that the biggest improvement in my projections came when I started projecting each component skill separately and stopped trying to estimate the number of hits a pitcher would allow based on past history.
I don’t consider this methodology to be particularly complicated, but it has been effective, as has been verified in studies of projection systems accuracy done by Baseball Prospectus and Theron Skyles.
Chris Constancio: My answer is “it depends.” I’m all for parsimony and Marcels is really enough for many purposes.
But if we recognize flaws, why not fix them? It’s not difficult to address common flaws in projection systems, including Marcels. Here are two shortcomings that I think are relatively easy to solve:
First, most projections look at past results rather than the components that make up those results. This is an especially important problem when we have limited information about a player. Looking at batted ball information rather than the results of those batted balls would be a good start. Most projection systems fail to consider something as simple as batting average of BABIP. For example, Marcels’ Ryan Zimmerman projection implies a .355 BABIP this year, and that alone is good reason to doubt his projected .308 batting average. PECOTA also has unrealistic BABIP implied in some player projections. There’s no reason for this.
Second, we need to allow for heterogeneous paths of development. Baseball players develop in ways that are more alike than different, and that’s why a system like Marcels seems to work very well when you look at group averages of projections. But that doesn’t mean the differences between players are not significant. I include “significant improvement scores” for several components of performance in my projections. These scores estimate the probability that a player makes a one standard deviation improvement or better in some area of their performance. For example, let’s consider 24-year-old hitters because we know this group improves on extra-base hit percentage and home run percentage relative to past performances.
Here are the lowest 24-year-old breakout scores for ISOP this year:
1. Colt Morton
2. Jeff Frazier
3. Chris Carter
4. Mitch Maier
5. Alejandro Machado
6. Javon Moran
7. Seth Smith
8. Kevin Melillo
9. B.J. Szymanski
10. Chris Stewart
None of those players have significantly improved upon their power production from last year or their baseline ISOP when we adjust for league and park contexts. Only two players (Machado and Smith) have made incremental improvements in ISOP even though all the above players are at an age where we expect improved power in general. I’ll share some more rigorous analyses in the future, but I think this is a simple illustration that demonstrates there is some value to allowing many paths of development.
Nobody will ever come up with a perfect projection system. But that doesn’t mean we can’t improve on what we already have.
Dan Szymborski: While I think that projection systems are unlikely to improve to a significant degree (with one notable exception), the tweaks that are applied to systems are generally based on research that improves our knowledge of the game and how players age. It’s always good to have more information even if the utility of that information isn’t large or obvious. What’s the notable exception? Being able to more accurately model how pitchers interact with the defenses behind them will still likely cause us to be able to project pitchers with an additional degree of accuracy. Or, perhaps more accurately put, a lesser degree of inaccuracy!
Ken Warren: I think that trying to project a pitcher’s ERA is basically a futile exercise. Or another way of looking at it … measuring the accuracy of our projections by comparing a pitcher’s actual ERA with our projection is rather meaningless.
A pitchers ERA is just as heavily influenced by BABIP, left on base percentage (LOB%), usage pattern as his skill. Take these pitching lines from 2006 as an indicator of how silly a measure ERA actually is. And how futile it is to try to project it.
There are well over 100 pitchers whose ERA do not reflect their pitching skill at all. If we used xERA as a measure of pitching skill and as a method of measuring the accuracy of our projections we would be onto something useful and meaningful.
Mitchel Lichtman: I agree with Tango that collectively, there is not a whole lot of room for improvement in our projections. However, to quote, or at least paraphrase, Tango again, “Why do something that is clearly incorrect?” In other words, if we can improve our models just a little bit, why not? As well, and perhaps most importantly, in terms of advising major league teams, and playing in fantasy leagues I guess, if everyone can project player performance 90% of the way, the team/person that can do so 95% of the way is going to get all the money, so to speak. As well, we (as forecasters and sabermetricians) still have a lot of inroads to make as far as trying to project “player development.” In other words, are there some identifiable characteristics of players which might allow us to predict their path of development? That is for both batters and pitchers.
For pitchers, and I think Tango will agree with this, one of the next frontiers in the projection arena (and other important arenas) is using pitch TLV (type, location, and velocity) data. For example, one reason why we often see drastic changes in pitcher performance is that the nature of their pitches and pitching style actually change. Velocity might decrease after an injury for example. If we can identify and understand those changes, we can make (much) better sense of the changes we see in historical performance, and better forecast future pitching performance (skill/talent). This is an exciting and significant frontier, by the way.
As I have said many times in the past, defense independent pitching stats (DIPS), which is what Ken is essentially talking about when he speaks of BABIP and xERA, is a shortcut for regressing a pitcher’s non-HR BIP components, or a “poor man’s regression”. The more data you have, the worse using DIPS is. In other words, when doing projections for pitchers (and batters), the proper regressions should be used, and not DIPS. DIPS regresses a pitcher’s BABIP 100% regardless of the sample size of the data, which is not correct. If you have a lot of historical data on a pitcher, substituting some constant for a pitcher’s BABIP or non-HR component rates is a big mistake. For a little data, it is not much of a mistake at all, and is an OK shortcut.
For the same reason as above, the more data you have on a pitcher, the more ERA and xERA (or DIPS ERA or Fielding Independent Pitching) will converge. Eventually, they will be exactly the same. So it is not folly to try and predict a pitcher’s ERA. It is exactly the same as predicting a pitcher’s xERA. It is just that there will more random fluctuation around a pitcher’s ERA than his xERA (or DIPS or FIP). That is a fairly common misconception, even among competent forecasters. ERA is a perfect measure of a pitcher’s skill. The only qualification of course is that ERA necessarily includes the defense behind the pitcher. In fact, if you can estimate the skill of that defense (by using something like UZR projections; or even PZR for the pitchers themselves on their historical data), you can even better project a pitcher’s ERA, which will be sort of a combination of his xERA and his defense, adjusted for home park and league of course.
Voros McCracken: Everyone else has covered my first basic point well: if we have additional information that we’re fairly certain is rock solid, what are the costs of using it? Maybe the benefits are minor, but we’re not exactly using abacuses (abaci?) here so the costs are even more trifling. We know that the more a player strikes out, the more (all other things equal) he homers. We know the more he homers the more other kinds of extra base hits per hit he racks up.
My other point is that regarding the projections themselves, the individual systems are only half the equation. The data that goes into a projection system is as important as what the system does with it. Tom tested against players with three full years of major league data, but as anybody who does these knows, that represents a mere subset of the total pool of players people would like to have the info for. The mathematics of the system often help with dealing with situations where a guy has 500 PAs, 70 PAs and 300 PAs in consecutive seasons.
Furthermore when it comes to translating players from lower level leagues, even if we have perfect minor league equivalencies (MLEs), because of the nature of regression, it may not be valid to use those MLEs. A system setup using data from previous major league performances of regular players carries with it a sample mean that is much higher than the average A-ball player. When the system drags extreme performances toward that mean, such a calculation could be perfectly valid for a major leaguer, but could significantly overrate a low minor-leaguer. In practice I’ve found (when it comes to minor leaguers) this is a bigger problem for pitchers than hitters.
To fix that, again additional complexity has to be built into the system. So I think while additional complexity may not help much in a large number of cases, it can be of benefit in the less common cases that tend to be untidy.
One last thing, while modern computers seem to make something like multiple linear regression easy to do, the percentage of people (in or out of the baseball world) who can actually do the linear algebra necessary to do those calculations by themselves is really very small. Linear regression functionally is a complex formula so I’m not sure Tom is right by treating it as such a simplicity.
Tom Tango: This debate is really a love-in. Who’s got the pillows, and are we going to be playing Lennon tunes?
My pre-initial statement was that the Marcels gives you an regression coefficient (r) of .65, and the maximum you can reach is .75. The sophisticated engines get you to .70. I don’t think anyone refutes that. (All for hitters.)
Now, the question to ask is: what is the cost/benefit of trying to improve on Marcel? Certainly a team paying someone $100 per hour has great benefit for little cost, so of course you want to get off that .65 and move towards that .75.
I’ve got a basic “component-based” projection for hitters here.
And one for pitchers here.
What Voros discussed about the uneven playing time I’ve got here.
I’m sure I can improve upon that by looking for guys with high speed and low power, or various (all) combination of components, since as Voros said, they are clearly not independent.
I’m not saying not to do all that, but rather to treat it as you would anything in life: I’ve got limited time, so what’s the cost/benefit here?
I offer the masses Marcels as the minimum starting point that any forecasting engine should beat. Then, it’s up to the forecasters to figure out how best to spend their time.
Obviously, for the youngsters with limited MLB time, the Marcels are completely wrong. The interesting cases are undoubtedly these players (and amateurs).
I agree with MGL (Mitchel Lichtman) about the changing pitching approach, guys like Bartolo Colon, I suppose, or any pitcher who had surgery.
And the pitch-by-pitch data would certainly unearth other interesting data. Curt Schilling is Curt Schilling because he has great command of the strike zone. Greg Maddux and Jamie Moyer too. I’m sure there are some pitchers with great tools that simply have no idea how to pitch. While not about forecasting per se, this is the area of interest that is most exciting.
You can gain a million dollars in value with a better forecasting engine. You can gain ten million dollars in value with an understanding of the pitch-by-pitch data. (Or ten and a hundred. Whatever.) Forecasting is the coal to the diamond that is pitch-by-pitch.
David Gassko: But the fact is, most of us do not have access to pitch-by-pitch data. Is the goal then for most forecasters simply to tease out relationships between different variables, and figure out how different statistical categories interact ( i.e. a pitcher who allows a lot of doubles one year might allow a lot of home runs the next)?
Also, Ken’s statement that “trying to project a pitcher’s ERA is basically a futile exercise” got me thinking: How do we validate a system? Tom says that the maximum achievable correlation for hitters is .75, and you’ll get .65 with a basic projection system. But how do we measure accuracy? Ron Shandler wrote a great article a few years back (located here ) called “The Great Myth of Projection Accuracy.” And also, I’m not much with math, so maybe someone can help me out here: How much of a difference does going from .65 to .70 to .75 make? How much does that decrease our standard error? I have a hunch that the correlation numbers understate the amount of accuracy we can gain by improving our projections. Also, those numbers are for hitters. What about pitchers? Is there more accuracy to be gained in pitcher projections than there is in hitter projections?
Mitchel Lichtman: While pitch by pitch (or play-by-play) data is nice, it is not necessary for a good forecasting model. Again, the shorter the run, the more help we get from granular data (to tease out the luck). I’m not sure that Ron Shandler uses PBP data for his forecasts and he is very good. And yes, one of the important, and often overlooked, aspects of a top-notch forecast is understanding how components relate to one another—such as a player’s fly ball doubles and home runs.
As far as measuring “accuracy,” that is a can of worms. There is not one universal method of measuring accuracy. A correlation coefficient tells us one thing, a least squares method, though similar, tells us something else, etc. Let’s say that one forecaster has a .7 correlation coefficient among all “qualifying” (whatever that is defined as) players and another one has a .72, but the .7 guy does a lot better job with the tougher forecasts (say minor league players). Who did the better job overall? What about if we just concede that all forecasters can do a credible job of projecting established players in the middle of their careers and we just look at the tough players? What if we look at all players who had an anomalous year, good or bad, as compared to their history and as compared to the average forecast, and we see who did the better job with those players? In other words, was anyone able to “predict” such an anomalous year? So depending upon how we want to define and measure “accuracy,” one forecasting method or another can come out “on top.” Certainly if you have a credible overall methodology, you should be in the same ballpark with other credible forecasts, in the “r” or least squares department. If you are not, you are probably doing something wrong. On the other hand, that does not necessarily mean that you are not top-notch in handling some subset of forecasts, like minor league players or players in general with limited major league histories, or perhaps part-time players who become full-time players.
By the way, accounting for injuries plays a pretty large role in forecasting. Some methods do it better than others (I assume). I have to admit that I don’t do it very much at all since I try to “automate” my projections (let the computer and databases do all the work) as much as possible.
Tom Tango: Basic pitch-by-pitch data is published by Retrosheet every year. It can also be spidered from MLB.com or ESPN.com.
As for testing the accuracy, for hitters, using OPS as the benchmark (or better yet, 1.8*OBP + SLG, or even better Linear Weights per PA) is good since that encompasses each of the components. And in the end, linear weights is what we care about anyway. We don’t really care if you got the home runs or walks right, as long as you got his production.
For pitchers, using ERA is problematic, because, as Mitchel noted, it is heavily influenced by a team’s fielders, and a pitcher’s timing. (That’s why you can’t get as high an “r” for ERA.) We’re much more interested in his components. So, we’d want to create a component ERA. But, that is also problematic, since that has some fielder influence as well.
David Gassko: But why linear weights, necessarily, and not, say, fantasy points? Most interest in projections comes from fantasy players anyways. Also, Tom said, “For pitchers, using ERA is problematic, because … it is heavily influenced by a team’s fielders, and a pitcher’s timing. We’re much more interested in his components. So, we’d want to create a component ERA. But, that is also problematic, since that has some fielder influence as well.”
This got me thinking: Most projections don’t really incorporate playing time. Players are projected for roughly the same number of plate appearances (400-600), and their stats based on that. But shouldn’t we be more interested in projecting playing time, if the goal of a projection is to say, “This is what we expect to happen next year”? The same goes for defense and timing. If we know that a pitcher continually does better than his component ERA, or that he has a good defense behind him, why not adjust for that?
Mitchel talked about projecting injuries: That falls into the same category. Why not try to project exactly what will actually happen (probably by running a million simulations)? I think only Diamond Mind actually does this.
Tom Tango: Right, it’s whatever “production” you are actually wanting to measure. For a general manager that’d be linear weights. For a fantasy guy, it would be fantasy points.
As for a pitcher, I’ll take it back then. It would have to be ERA for the general manager, if the general manager believes that a pitcher can time his performance. If not, then it’d be component ERA. (Same for hitters. Perhaps to a GM, he wants to measure runs and RBIs.)
As for playing time, Marcel does forecast this.
And, the playing time issue is interesting. The more a player plays, the higher his actual production. So, here’s a secret to helping in beating a forecasting system: if all the forecasting systems agree to a 400 PA cutoff, then you have to adjust your forecasts up for the guys who were semi-regulars in the past. If they don’t reach 400 PA, then you know they did their usual crap. If they get past 400 PA, then you know they did good!
So, I would say that what you should do in comparing is some sort of runs above replacement. After all, the GM is really trying to figure out how much to pay a guy. And runs above replacement would be the best approximation.
Mitchel Lichtman: I agree that what you want to measure and test depends on the context. I do think it is a good idea to test the accuracy of all the components, as that eliminates the possibility that you nailed a hitter’s OPS or linear weights (or pitcher’s ERC or ERA) accidentally (if your forecast is four times too high for walks but two times too low in home runs, but the linear weights or OPS is right on, did you really do a good job or did you just get lucky?).
As for pitchers, I don’t think there is much, if any, evidence that pitchers can control their ERA above and beyond their component ERA. In other words, I think it is a safe assumption that a pitcher’s real ERA will always converge to his component ERA (as long as the component ERA is normalized properly of course).
What I do is project a pitcher’s components and then turn that into a normalized component ERA (NERC). That way one pitcher can be compared to another, as long as you are comparing starters with starters, relievers with relievers, etc. If someone wants to turn an NERC into a regular ERA, they can easily do that by incorporating league average ERA, home park, estimated defense, etc., although it is not necessary.
If I were to test the accuracy of a number of pitcher forecasts, I would either have the forecasters translate their projections into a regular ERA or have them forecast the components and then compare projected ERC (component ERA) to actual ERC.
David Gassko: Mitchel talked about the importance of comparing starters to starters and relievers to relievers, while Tom mentioned how players do better than expected once they reach 400 plate appearances. Essentially, what that tells us is that we need to make all kinds of adjustments to our projections based on projected role, as well as other adjustments for things like the “Coors Hangover” effect, old players skills, whatever. How important are these adjustments in your opinion, and how far should we take them?
Mitchel Lichtman: They are very important, some more than others, and often what separates the Rolls Royces of projections from the Yugos. I think I am dating myself with the Yugo reference. It’s not even in the Word dictionary!
Chris Constancio: First, regarding assessment and validation; this is a very important issue. A lot of baseball researchers are very happy with systems that result in high overall correlations without addressing basic diagnostics such as looking at the spread of the errors. In any other industry but baseball research, those researchers would be laughed out of the room.
There is actually a very close parallel with the development of run estimators here. Some people are happy with simple run estimation methods that result in the low RMSE when predicting MLB team run scoring rates. That’s fine for some purposes, but those systems can also be atheoretical and fall apart in unusual contexts. Similarly, basic projection methods are fine for many purposes and get you 90% there if you are looking at a certain subset of ballplayers. But they also fall apart for players outside the range they were designed for and don’t contribute much to our theoretical understanding of how players develop and how contexts influence performance.
A system like Marcels only works well for about 1/10 of all professional baseball players, so I think there’s plenty more to accomplish in this area. I understand that some people are only really interested in 1/10 of all pro baseball players and so Marcels or something only slightly more complicated is probably good enough for their purposes.
Second, regarding role/context effects; I agree they are very important. On the other hand, it’s very frustrating (from a forecaster’s perspective) to try and guess what a team will do with certain players before April. I suppose a perfect projection system would include several sets of projections for one player to account for different scenarios. On Opening Day, the reader (the front office person, the fantasy player, or whoever) can make use of whatever tweaked projection makes most sense.
Ken Warren: I think that the problems with ERA are primarily BABIP and strand rates which are out of the pitchers’ control. It seems to me that random fluctuation is more of a factor than team defense as BABIP will fluctuate dramatically even for pitchers on the same team.
Component ERA is definitely something that can reasonably be projected and is also the method by which our projections should be evaluated.
I’d like to add that I strongly agree with what Mitchel said regarding component ERA.
Mitchel Lichtman: Forecasting playing time is not something I do. I think that is more in the purview of fantasy baseball. It is also more of an art than a science, obviously, as one must be able to anticipate how a player is going to be used by a certain manager and whether he is going to be used at all. That is for marginal players of course. Many players are always going to be starters (batters) and the only thing to really forecast as far as playing time is concerned, is chance of injury. And I’m not sure there is a whole that can be done in that area (predicting injury).
If you are forecasting for a team, the only thing you are really interested in is providing them with a “rate” projection. Of course, they might want to know how “reliable” or durable a certain player is, but I don’t think one can really provide a credible estimate as to playing time/chance of injury. In fact, trying to project playing time or durability of you will, for a position player at least, is an overrated area I think. I think that a player’s injury/health history has marginal predictive value at best.
For pitchers, it is a little trickier. Some pitchers appear to be more durable than others, as far as innings per year and innings per game. And of course, we find that pitchers have vastly different performance results depending on their role. For example, we find that the difference between starting and relieving, for any given pitcher, is more than a run in ERA (it depends on their roles—for example, a pitcher who relieves but rarely starts and is not used to starting appears to do terribly worse when he starts, whereas pitchers who are swingmen appear to pitch less than a run better in ERA when relieving). That is not to say that if you have an ineffective starter and you turn him into a reliever, you can automatically expect him to pitch a run better when relieving (or vice versa). It probably takes a trained eye—a coach, manager, GM, scout, etc.—to be able to ascertain what role might be best for a pitcher. I don’t get involved in trying to project a pitcher’s performance depending upon his role. If a pitcher is a reliever, I project his reliever performance. If he is a starter, I project his performance as a starter, and if he is a swingman, I assume he is going to be a swingman in the future.
Ken Warren: There are four basic reasons why pitchers have better ERA as relievers than starters.
a) By pitching a minimal amount in each appearance, fatigue is not a factor,
b) The opposition doesn’t usually get to bat against them a second time in the same game
c) They frequently come into an inning with one or two outs already achieved, thus radically reducing the chance of being charged with an earned run in the first inning they pitch in
d) Walk-off innings. If a game is over they don’t have to deal with the mess they created. For example if a pitcher is pitching in the bottom of the 9th in a tie game. He allows single, single, single, walk and is charged with one earned run. If this happened to a starter he would be facing three or four earned runs allowed.
Mitchel Lichtman: Actually, the main reason is that they can throw as hard as they can for an inning or so, which is the same thing as Ken’s “a”. The others are minimal reasons. The way we know that (at least that c and d are minimal) is that if we use ERC (which eliminates the problem of “c”) and remove “walk-off innings”, we still come up with a large difference between relieving and starting.
Also, when we say the “difference between starters and relievers” we mean given the same pitcher. Starters are much better pitchers than relievers, as a whole. It is when a given pitcher relieves, he is able to pitch much better than as a starter.
Tom Tango: Right, in The Book, I looked at component ERA, and concluded an 0.80 to 1.00 ERA difference between the role of starter and reliever, for the same pitcher.
The first-time-through the order effect is real, but not enough to explain the difference. Throwing “all out” is the prime reason.
References & Resources
xBA, xERA, xOBA and xSLG, speed rating and “contact rate” are all proprietary statistics developed at BaseballHQ.