Man vs. computer

A few days before the start of the 2009 season, I wrote a column here titled, “29 players I think the THT projections got wrong.” The title is pretty self-explanatory, but let me quote the introduction to that column so that you know where I was going with it:

Each of the past three years, we’ve released projections for thousands of players, and each year, I have received tons of e-mails relating to specific players readers think we have over- or underrated. Frankly, I’m with the readers—our system is very good, but it is not perfect. Sometimes, I think I know more than it does, and today, I’ve decided to test that thought.

What follows, then, is a list of 15 hitters and 14 pitchers who I think will either over- or underperform their projections, with my reasoning explained. I formed this list without looking at other projection systems, since the idea here is to figure out if human intuition can beat a computer-based system, rather than trying to find areas where some other projection system outperforms THT. At the end of the season, I will check in to see if my hunches were correct, or if the computer knows best.

To be clear, I selected only hitters projected to have at least 500 major league plate appearances and pitchers projected to have at least 100 major league innings pitched; I wanted to avoid, as best as possible, players who won’t play much in the major leagues in 2009.

The comments I got on that column were mostly skeptical; Mitchel Lichtman, a former senior advisor to the Cardinals, for example, put it bluntly: “I am always skeptical of these, ‘I can beat a good forecast system just by looking at the forecasts,’” he wrote. Fair enough.

A commenter on Baseball Think Factory was of the same opinion: “I expect the outcome will be that DSG (those are my initials) can’t beat the computer.” Frankly, I felt the same way. Still, intuitively, the 29 projections I challenged looked wrong to me, and I figured it was worth it to find out if my gut actually could see something that a computer cannot.

Now that the season is over, we can answer that question, so let’s get to the results.

First up are the hitters. Let’s start with those I thought would do better than their projections. Those were Justin Upton, Alex Gordon, Delmon Young, Robinson Cano, Ichiro Suzuki, Evan Longoria and B.J. Upton. Right away, it’s easy to see that some of these hitters indeed beat expectations while others actually went the opposite way.

Overall, however, though we projected this group to have a .779 OPS (weighted by the number of plate appearances they had this season). In actuality, they posted an .819 OPS, which amounts to a 40-point difference! (Actually, 41 after rounding.) So far, so good—I thought these hitters would beat their projections, and in sum, they sure did.

So what about the hitters I thought would do worse than we projected? That list consisted of Chipper Jones, David Ortiz, Miguel Cabrera, Mike Napoli, Carlos Delgado, Jack Cust, Ryan Howard and Chris Davis. Again, we have a fun mix of guys, and overall the THT projections had them posting a combined OPS of .934 this season. Instead, they’ve posted an .843 OPS, or 91 points below expectation. That’s another big win for me—I’m two-for-two!

Let’s move on to the pitchers. I thought that Dan Haren, Clay Buchholz, Rich Harden, Mark Buehrle, Edinson Volquez, Zach Greinke and Francisco Liriano all would beat their projections. Perhaps Greinke’s name tips you off as to how I did with this group—overall, our projections had them posting a 4.33 ERA, but they blew that out of the water, combining for a 3.63 ERA instead. That’s a huge difference, and I have to say, my predictions are looking good thus far.

We still have one more group to look at, though, and that’s the pitchers I thought our projections overrated. They were Derek Lowe, Fausto Carmona, Jeremy Bonderman, CC Sabathia, Justin Duchscherer, Dana Eveland and Joe Blanton. Our projections thought these guys would be good for a 3.69 ERA this season; instead, they came in at 4.58, a whole 89 points below expectation! Clearly, I’m a genius.

Or am I? After I wrote my column, some suggested that my predictions were indeed going to be right, but that rather than being a feature of my extraordinary brilliance, it was merely an indication that the THT projection system was not very good. That’s a double whammy for me—not only does it call into question my intelligence, but I also designed the guts of the THT projections system. I think it’s fair to ask whether there is some bug in the design of our projections that allowed me to beat them.

To answer that question I looked at what another projection system said about the four groups of players we just examined. Essentially, since I did not consult any other projection systems when making my predictions, the other system can be used as an independent control: If my predictions turned out to be right simply because I was taking advantage of some defect in the THT system, another system would have the players correctly projected. If, on the other hand, my gut was able to see something a computer could not, any computer-based system would have been off for these players.

I turned to CHONE, which has been shown to be one of the best projection systems over the past few years. CHONE is also a completely computer-based system, making it ideal for this test. Due to the nature of statistics, CHONE should do a better job projecting these players than THT did no matter what; if I had chosen the 29 CHONE projections I hated most at the beginning of the season, the THT projections too would have been closer to the truth.

Let’s start with the hitters I thought would beat their projections. CHONE predicted they would post an .796 OPS, a number they bested by 23 points, OPS’ing .819. As for the hitters I pegged to underperform their projections, CHONE thought they would combine for an .898 OPS as group; they actually finished with an .843 OPS, which is 55 points worse.

A Hardball Times Update
Goodbye for now.

The pitchers I thought would out-perform expectations got a 3.97 ERA projection from CHONE; they managed to beat that by .34, at 3.63 ERA. CHONE gave an overall projection of 3.75 to the pitchers I thought would underperform; instead, they had a collective 4.58 ERA, a whole 83 points worse.

To recap:

Hitters     THT     CHONE   Actual
Better      0.779   0.796   0.819
Worse       0.934   0.898   0.843

Pitchers     THT     CHONE   Actual
Better       4.33    3.97    3.63
Worse        3.69    3.75    4.58

Overall, the CHONE results confirm that my predictions were spot-on! Though the CHONE projections were closer to the ultimate truth than THT’s, they still were too low for the players I thought would beat their projections and too high for those I saw faltering. In other words, I do appear to be some sort of genius.

Well, not really. For one, I have no idea why I was able to beat two very good projection systems at their own game. My expectation was that a computer would be much better at assimilating a lot of statistical information into one final prediction than the human brain, and while I still do believe that to be the case, it does appear that we humans can see something computers do not.

Looking at the hitters I thought would beat their projections, I saw a lot of special skills, most of them young, but all very talented. Not all have capitalized on their abilities (*cough* Delmon Young), but overall, I think this is the kind of situation in which a scouting eye can tell you something that cold, hard numbers cannot (not that I have a scouting eye, but even I can see insane talent like the Upton brothers).

The hitters I thought our projections overrated were mostly some combination of old, fat and strikeout-prone. That’s never been a good combination, and though the statistics should bear that through, perhaps the computers aren’t quite as quick to catch on to when a player is going to falter due to those factors as a human can be.

The pitchers are a little more difficult to classify. The only thing that really jumps out at me is that I liked a lot of high-strikeout guys, while a lot of the pitchers I didn’t like are below-average at whiffing hitters. I think it’s very possible that pitchers with big arms often can break out in a way their past statistics would not predict, while those with low strikeout rates walk a very fine line between successful major leaguer and batting practice tosser. Perhaps we humans are a little better at seeing that line than computers.

But maybe not. Honestly, though I am fairly convinced that computer projections are not perfect, and that a baseball-crazed human being can pick out some numbers that just aren’t right, regardless of their statistical validity, I can’t say at this point that I know why. The human mind is a complex machine, more complex than any supercomputer yet built, and so it is not simple to decipher what processes exactly allow us to better a computer projection with our gut.

The important lesson here, however, is that human analysis does indeed have something add in understanding a player’s abilities and talents beyond what a computer projection will tell you. Computer projections are very good, and 99 percent of the time, they’re as good as or better than what we can do, but that other 1 percent—well, that’s where we analysts come in handy.


16 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Jeff
14 years ago

Not to quibble david, but while you clearly beat the forcasting system using groups of players from an individual point it doesnt seem as if you hit on a signigficantly higher percentage. Just to look at the pitchers for example: Grienke was better Buccholtz was worse Haren was better volquez got hurt and exceptionally lucky (given his walk rate) until then. Buehrle put up the second highest FIP of his career, same for harden, and well liriano was just plain shock the lead out of your pencil bad. As for Dan Haren, he had a great season, but it seems in line with expectations.  Similar era to 07 and 08 and a FIP higher than 08 but lower than 07.  I’d try this with the others, but well I am ostensibly working today.

Marty Winn
14 years ago

A computer model is only as good as the information it is given.  You obviously have some gut feelings about how players are likely to perform.  That needs to make it’s way into a test (like you did) to see if these characteristics (fat, age, pitch velocity) have an effect on the player beyond what base stats show.  If you get a correlation put it in the projection model.  Of course some of this is unquantifiable and might make the equation unworkably complex.

David Gassko
14 years ago

Hey Jeff,

I would argue that all we care about is how I did with each group overall, but to be clear, a poster on another site did the math and says that 21 out of 28 picks were correct (not counting Duchsherer since he didn’t pitch this season), which is a very high percentage.

Jeff
14 years ago

David, that is indeed a great rate. It’s nice to know not everyone is lazy.

ecp
14 years ago

@Jeff – That was my first thought as well…Why reassess in groups when the original assessments were done on the individual level?  Also seems to me that those who did not meet the minimum innings pitched (100) or plate appearances (500) requirements stated in the beginning should have been tossed as not counting. 

David, I haven’t looked at the results person by person, but first glance tells me that you were correct about 50% of the time.  Plus a guy like Greinke really skews the whole group of those pitchers you expected to outperform.  He was SO much better than the 4.44(!) projected that it makes everybody in the group look good, when some underperformed (see Liriano).

I have to admit when I read this in the spring that I was going to accuse you of cherry-picking players, then it occurred to me:  Of course you were cherry-picking!  That’s the point.  You went after those you thought were clearly wrong.  And that, to me, may explain why you were able to “beat” two systems – because you could inject the “educated guess” element into the projection, which clearly helped you get some of these correct.  To wit:  Of David Oriz, you said “I have a hard time believing he will rebound to his previous heights,” and, despite his coming on after June 1, you were dead right because expecting a .951 OPS from Ortiz was just insane.  Or Greinke:  “the Zack Greinke we saw last season (2008) is much more likely to be the player we see going forward,” a statement which anybody who saw him knew intuitively was likely true.

So all that means what, exactly?  Just that no projection system is perfect.  There will be hits and there will be misses.  It’s a guide, not a gospel.

ecp
14 years ago

Just saw your comments on 21 of 28 being right, so my 50% off-the-top-of-my-head assessment is wrong.  But I still think a guys like Edinson Volquez, Jeremy Bonderman, Carlos Delgado, and Alex Gordon should be tossed because their lengthly injuries severely curtailed their playing time.  As should guys who spent a large amount of time in the minors for effectiveness reasons, such as Dana Eveland and Chris Davis.

Bruce
14 years ago

Claiming victory over CHONE is validating a prediction you didn’t make. While it’s interesting that the actual performances were “more” than predicted by CHONE, in the aggregate groups, CHONE’s over/under was as correct as yours.

David Gassko
14 years ago

I don’t think I really get your comments, Bruce. CHONE was used as an independent control, to see if the problem was not computer generated projections but THT projections specifically. It wasn’t. And like I said in the article, it is a mathematical fact that the CHONE projections would be closer to the truth than THT’s (provided THT’s were off) because I picked the worst THT projections I could find (worst of course being my own subjective, though now confirmed, opinion). If I had gone through the CHONE projections and done the same, using the THT projections as an independent control, the THT projections would have been closer to the truth, though, this study suggests, still off.

David Gassko
14 years ago

“As should guys who spent a large amount of time in the minors for effectiveness reasons, such as Dana Eveland and Chris Davis.”

How does that make any sense? Why shouldn’t I be rewarded for pinpointing guys that would underperform their projections, just because they underperformed so badly they were sent down to the minors?

ecp
14 years ago

“Why shouldn’t I be rewarded for pinpointing guys that would underperform their projections, just because they underperformed so badly they were sent down to the minors?” 

Point taken.  I was thinking that, however, because at the beginning you said you wanted to pick out only those guys who would have at least 100 IP or 500 PA, so in my mind anybody who doesn’t meet that threshold doesn’t get included.  Especially when one of them is a guy like Bonderman when he spent most of the year in the minors for reasons other than ineffectiveness and pitched a grand total of ten innings.

Bruce
14 years ago

Using CHONE as a control is an effective approach, and I agree with the broader idea that human input (intuitive or otherwise) can help us learn about and improve projection systems. I have a strong reaction to at least a handful of projections each year, and I enjoyed reading about yours. (which are surely more likely to prove correct than mine)

My earlier comment is a reaction to…

“For one, I have no idea why I was able to beat two very good projection systems at their own game.”

…since you didn’t. That’s a separate conclusion from the fact that the projection systems missed in the same direction. Had you looked at CHONE’s numbers at the start, you might have chosen to add or omit particular players, forming different groups.

Jonathan
14 years ago

What happens if instead of downweighting low PA players like Alex Gordon, we add in a league average or replacement level player for the rest of the PAs?

David
14 years ago

DG:

whilst most of the analysis is a bit of puffery, you hit the nail on the head in the final paragraph: i.e., that there are instances where statistical projection just misses the boat, and that is where the human eye is necessary.

You did not actually “beat” the system at its own game; you looked through projections which are based on imperfect information, identified the handful that did not have face validity, and then predicted that x would be better than their projections, and y would be worse.  If you wanted to assess your ability to beat the system, you would of course randomly select 30 players, make some sort of guess at how they would do, and then compare your projections to what THT or CHONE or whatever other system projects.

In fact, you got (apparently) 21 of the 28 directionalities “right” (though I do not see the margin of error – is it possible, for example, CHONE predicted that David Ortiz would post an OPS of .800, you predicted he would do worse, and he posted an OPS of .795?) 

As others say, a better metric would be to assign some margin of error, and then look, individual by individual, to see how much better you had done, rather than to create an amalgamation of data and then claim victory.

All in all, however, your point is true – that there are cases when projections based solely on statistics will be off, and that there needs to be a human hand on the tiller.

Bill P
14 years ago

Great post.  I’m not really sure what goes into THT or CHONE, but I assume the predictions are based almost entirely on past performance.  We humans read all kinds of baseball stuff, though, not just numbers-based stuff, and absorb it all as “common wisdom”. 

A lot of the common wisdom is basically scouting.  Writers quote scouts saying “this guy’s skills are way better than what we’ve seen so far”, while for others we hear that they’ve been overperforming their skills.  Although I trust numbers more than scouting, I still think scouts can add information that can’t be quantified. 

When you say that some of the predictions just didn’t look right, I’ll bet these distilled scouting observatons play a big part of that.

Swami
14 years ago

Thanks for a very interesting study.

The computer is an excellent processor of objective data, while you are an imprecise processor of both objective and subjective data.  In most cases, the computer probably comes out better, because of the precision of its processing (you can test this by projecting random players).  However, in a few cases, the influence of subjective factors (and errors in estimating impact of objective factors) will be larger – in those cases the computer projection will be significantly off.

In some of those cases, your imprecise mental processor that takes subjective factors into account shows a large deviation that you have good confidence in.  You correctly identify these as outliers, and the results vindicate that your mental equipment was good at identifying outliers.

The very interesting result is that in those specific cases, you beat another projection system as well, though by a lesser amount.  This lets us separate the two effects: effect of missing subjective data, and effect of inexact model.  Since a different model need not be inexact in the same direction, the fact that it covered only half the discrepancy shows that the rest is attributable to subjective data (i.e. data we haven’t learned to properly quantify yet).

I guess this outcome applies to all modelling (even scientific modelling).  The best of models will still have errors due to gaps between model and reality.  Human intuition has a chance to avoid these gaps.  However, the imprecision of its computing engine means that in most cases the computer model will outperform the human.  Of course, the likelihood of humans outperforming the computer model is signficantly higher in complex problems such as baseball performance (and economics!), than mature (mostly) deterministic worlds such as physics.

Unsurprising result, I think, though this may be a lot harder to do a decade or two from now when computer models mature even more.  But excellently performed study, thank you.

Alex Zelvin
14 years ago

What’ll be really interesting is if you can figure out what’s missing from the projection models that could make them more accurate.

There’s something that I’ve never heard anyone discuss that could explain some of these.  Assume you’ve got a 25 year old pitcher and a 39 year old pitcher.  Both have put up very similar stats for the past five years…in fact so similar that once adjusted before the expected improvement (due to age) of the 25 year old and expected decline of the 39 year, we have the exact same projection for them this year.  Let’s say they both get off to incredibly good starts for the first month of the season.  Our system is going to project similar stats for the rest of the season for each pitcher.  But isn’t it more likely that the hot start indicates a true change in ability for the 25 year old than the 39 year old?  And wouldn’t a slow start be more likely to indicate a true decline in the 39 year old?  I don’t think any current projection systems weight recent performance more heavily if it’s in the ‘direction’ that would more likely indicate a true ‘breakout’ or ‘collapse’ for the player, but this may be what was missing for guys in your example like Greinke and Ortiz.  Maybe their 2008 performances should have been weighted more heavily, because at their ages those performances were more likely to indicate a true change in ability.