Reaffirming our faith in DIPS

image
Are we really so sure that Matt Cain has some sort of special home run-preventing skill? (Icon/SMI)

Last week, there evolved a long discussion about DIPS, regression to the mean, xFIP, HR/FB, BABIP, and forecasting in general over at RotoWire. I linked to it here at THTF, and a few of you posted comments to it. There was one comment in particular that I wanted to respond to at length:

Dan Haren is singlehandedly destroying my faith in FIP, xFIP, and SIERA.

I’ve kept him on my team all year long and he just continues to kick me in the nuts, like today with his 7 earned runs allowed. His K/BB is great, his strikeout rate is good, yet his BABIP continues to be off-the-charts high.

Meanwhile, I dropped Tim Hudson in early May because I thought his .240 BABIP was not sustainable and his K/BB was barely at 1.0. Hudson just keeps rolling along. His BABIP is now .235.

I’m beginning to think too much knowledge is a bad thing. I make moves based on underlying peripherals and with the thought of “regression to the mean” in mind, and I’m behind owners who pick up Carlos Silva and Livan Hernandez.

Another commenter followed with:

You need to look at the actual player too. Some players are good at bettering the stats while others don’t live up to them. Matt Cain seems to better them while someone like David Bush is not. Bush had 2 season with a 1.14 WHIP and his ERA was like 4.4 and 4.2.

The coin flip analogy

While Matt Cain has posted better-than-average HR/FBs for a few years now (probably the best and longest we’ve seen since batted ball data has become available), that doesn’t necessarily mean he’s any better at preventing home runs on fly balls than Dave Bush. Think about it this way: If we have 8,000 fair coins and we flip them, probably 4,000 will land on heads and 4,000 on tails. If we take the “heads” coins and flip them again, about 2,000 will land on heads again. Flip those, and you get 1,000 of them landing on heads. Do this another nine times, and you’ll probably end up with two or three coins landing on heads each time.

But are these coins any different than the others we’ve been flipping? Is there something special about them that makes them more likely to land on heads than one of the original 4,000 to land on tails? Of course not. I told you in the beginning that they were fair coins. So if we flipped those last two or three another 8,000 times each, I’ll bet you they land on heads close to 4,000 times each.

While it’s hard to view humans in this way, we do know that humans don’t have ultimate control over everything in a baseball game and that random chance is involved. If it weren’t, we’d have a much easier time projecting performance.

But which coins will they be?

image
Most players are clustered toward the middle, but when a dataset is distributed normally, there will always be a few outliers in the 0.2% area.

We know that stats like HR/FB follow a (relatively) normal distribution (the same as our coin flips would). They form a bell curve (of sorts), with most players clustered toward the middle, but there are always outliers who are far removed from the middle. We also know that these outliers are rarely the same from year to year—the same as if we performed our coin flip exercise several times and marked each coin, we wouldn’t end up with the same two or three coins at the end of each trial. They’d always be different coins, even though we could be certain that we’d always end up with two or three of them. But predicting precisely which two or three would be impossible to do beforehand.

And the same holds true for things like BABIP and HR/FB. Sure, Livan Hernandez and Tim Hudson are having years where their ERAs don’t match their peripherals. But ask yourself this: How long do you expect them to continue doing that? If you don’t answer “indefinitely, because they truly deserve low BABIPs and HR/FBs,” then don’t beat yourself up. There’s nothing you can do, because the fact of the matter is, they are getting lucky. For the 2010 season, they are those final two coins remaining from the 8,000 flips. And it’s as simple as that.

And I put my money where my mouth is. I happen to own both Livan Hernandez and Carlos Silva in LABR NL this year (part of a strategy that involved owning a few crappy pitchers), but despite their successes, I’ve only used Livan for 87 innings and Silva for 70 (though I have begun to start Silva regularly over the past couple months because he’s combined legitimately good peripherals with a change in approach. Our coin flip example would still hold for him to an extent, though, because no one expected him to outperform his projections to this extent unless they scouted him in Spring Training and noticed his improved change-up, improved breaking ball, renewed control, etc.)

Second-half splits

To go along with this, I wanted to bring up one last comment from a post I made at the CardRunners site:

Dan Haren is an example of a first half ace. He’s a bum every second half… not only does his ERA jump about a run (3.29 to 4.22), but his WHIP goes from 1.10 to 1.31.

First half ERAs from 2006 to 2009: 3.52, 2.30, 2.72, 2.01. Second half from 2006 to 2009: 4.91, 4.15, 4.18, 4.62

Like with BABIP and HR/FB, “second-half ERA” is a stat with lots of variation. It takes many years to stabilize, and because it’s normally distributed, there will always be outliers, especially when dealing with smaller samples. In Haren’s case, we are dealing with a small sample of four poor second halves (plus two years where his second half was better than his first), so claiming that he’s merely a “first-half ace” may be a bit hasty.

So does that mean we know nothing?

No, it doesn’t. Just because it’s possible that Matt Cain is a true 11% HR/FB pitcher doesn’t meant that he absolutely is. Along with knowing that we’re looking at a mere sample and that what we’ve seen could be simple random variation, we have seen something. And what we’ve seen for Cain is a career 7.8% HR/FB. So what we do is weight his career and regress to the mean to remove the effects of luck as well as possible. Once we do that for Cain, we probably arrive at an expectation for his HR/FB of around 9% or so.

And as I said in my previous article in this long-running discussion, that expectation would change if we have other data (such as scouting or a PITCHf/x study). But unless we have that data, that’s the best we can do.

Concluding thoughts

I think that covers everything I wanted to cover, so if you have any questions or comments, feel free to let me know. I’m sure there will be some of you who will still be skeptical, so feel free to voice your concerns if you are.


18 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Mike Podhorzer
13 years ago

The coing flipping example is a perfect illustration of this point. This also reminds me of another good example that I learned about a while ago.

You know those gambling ads that promise “guaranteed winners” and give you a phone number to call? Well the way it actually works is that half the callers are told one team and the other half the opposing team. If 1,000 called, this means 500 people will win. The following week, 250 of the previous week’s winners will win again. The next week, 125, and so on and so forth. Several people who win every single week are going to end up thinking that this phone line has a crystal ball, but we know the secret. You could call these multi-week winners Tim Hudson or Matt Cain. Someone is going to win all those weeks, but you just have no idea who it is, and it doesn’t reflect any sort of skill.

Projection systems also provide a good example of this phenomenon. It is rare that any projection system will project a pitcher to win 20 games. However, nearly every year at least one pitcher does win 20. Does that make the projection system poor because it was wrong projecting no one to win 20 games? Of course not. We know some pitcher is likely to win 20, we just do not know which one it is, so projecting none of them to do so yields better projective (is that a word?) accuracy.

keith
13 years ago

Nice work.  Still, it sure is frustrating when the process is seemingly correct and the results are for the birds!

12 Team Roto
11 pts in K/BB
10 pts in WHIP
2 pts in ERA

><

obsessivegiantscompulsive
13 years ago

According to what I read on The Book blog, TangoTiger says that it takes around 7 seasons worth of results for a starting pitcher to have enough “coin flips” to show that his BABIP is legitimately lower than the .300 mean most regress to.  Zito has passed that threshold and Cain will soon, as well, at least for BABIP.

Given that BIP is significantly larger than flyballs, I guess that means that for most pitchers, except for those who pitch maybe around 15+ seasons as a starter, never can “prove” to be significantly below the HR/FB norm, right?

microwave donut
13 years ago

My theory is that Matt Cain has typical R/L splits for a RHP and SF’s home park kills HR’s to right field. However, looking at the HR/FB home/away splits you don’t really see anything but noise.

As for the better in the first/second half point I think it’s easy to forget that these guys aren’t robo-players.  So much of the game is mental and focus can absolutely fade or intensify throughout a season. What about players who are slow starters?  The most common examples (to me) are Mark Teixeira, Adam LaRoche and Alexei Ramirez, and they all held the trend this year.

Will Dan Haren be equally good in the first and second halves next year? I don’t know, maybe. Do I want to bet on it? God no.

Derek Carty
13 years ago

Well put, again, Mike.

I hear you, Keith, and that was a point I meant to include.  It sucks when you’re making the right decisions and getting poor results, but it’s a part of the game we play.

Derek Carty
13 years ago

obsessivegiantscompulsive,
I believe it’s actually 6 years, but to clarify a bit, it’s not that at 6 years something magically happens.  It’s not some magic number that, once we arrive at it, we’re all good.  It’s more of a continuum.  Basically, once we have 6 years worth of BIP, we’d regress a pitcher’s BABIP half of the way to the mean.  So let’s say we’re regressing a career .280 BABIP to a mean of .300 (ignoring weighting, aging, etc for now), at exactly 6 years, our regressed BABIP would be .290.  But that doesn’t mean we can’t regress with less than 6 years of data.  We’d just end up regressing more.  At 3 years of .280 BABIP, we might regress to .295 or so.  For a guy like Zito with a career .275 BABIP over 10 years worth of data or so, we can be relatively certain he’s not a true league average BABIP pitcher.  Of course, he’s getting older, so that will play into it too and perhaps mitigate some of them.

For HR/FB, I’ve run some preliminary tests and have found that it takes roughly 800 Outfield Flies (4 years-ish) to reach the same point that BABIP reaches at 6 years.  I’ll be publishing this kind of stuff for a number of stats sometime soon.

Pat
13 years ago

I think you hit the nail on the head. Randomness does factor a lot in it, but it is not complete random because it involves humans. Some love pressure, while other fear it. The reason I used Bush is that he is a different pitcher, when he has runners on. It is possible that it is a ton of bad luck but because of his long track record I would bet against it. Maybe reason is that he is pitching from the stretch or the pressure. From a statistics standpoint, someone like Bush should be looked at as a different pitcher when the bases are empty and when there are runners on.

In Dan Haren’s case it is completely possible that it is bad luck, but because of the stark differences in ERA and WHIP in each of the last few years, I think you have to take it as a trend.

Odds are greatly in your favor that this is the case. It could be wrong the best way to win is too take the best odds.

Pat
13 years ago

I like the stats. The BABIP in 07 and 08 looks very out of line. I think the question here is when, if ever, do you ignore the underlying stats and go with the results?

I actually would take Hudson for the rest of this year and Haren next year.

Where do you get BABIP, K/9, etc. split stats from?

PS: If anyone here listens to the ESPN Fantasy Focus Podcast, Matthew Berry and Nate Ravitz have the Dan Haren argument all the time.

patrick dicaprio
13 years ago

I have said it before and will say it again: when it comes to this topic you are mostly right, this is a perfect example of the Texas Sharpshooter Fallacy. Pat, above, talks about pressure, and others talk about stuff that tries to explain and give reasons for things that happen for no reason. There is not necessarily an explanation or a reason why everything happens; randomness is easily the biggest factor in most of life of fantasy baseball projections are no different.

you get points as an analyst for process, not for individual players. it is a fact that players will regress to their means, but predicting when and who will be the outliers is a fool’s errand.

i do NOT disagree, by the way that things like pressure are not important. they are. but to pretend that we know how a player will produce or how the synapses in his brain fire differently than in non-“pressure” situations, whatever that means, is ridiculous. it is armchair analysis that no self-respecting professional fantasy baseball analyst should undertake.

BobbyRoberto
13 years ago

@Pat,

For the BABIP, K/9 splits, I used Baseball-Reference.com.  I looked up the player, then used the Game Logs tab to look at a specific year.  Then I highlighted the first game and the final game before the All-Star break and got the totals for that stretch of games (this included BABIP).  For K/9, I just did the math based on the total Ks and IPs for the highlighted stretch of games.

obsessivegiantscompulsive
13 years ago

Thank you for clarifying Derek and for the preview on your future article.

BobbyRoberto
13 years ago

Even Haren’s first half/second half splits are wonky.

If you look at ERA and WHIP, there’s a compelling case that he’s better in the 1st half:

2009/1st:  2.01, 0.81
2009/2nd:  4.62, 1.26

2008/1st:  2.72, 0.95
2008/2nd:  4.18, 1.37

2007/1st:  2.30, 1.00
2007/2nd:  4.15, 1.50

If you look further, he looks mostly like the same pitcher each half, except for BABIP:

2009/1st:  .233 BABIP, 8.9 K/9, 1.1 BB/9
2009/2nd:  .315 BABIP, 8.5 K/9, 2.0 BB/9

2008/1st:  .256 BABIP, 8.0 K/9, 1.6 BB/9
2008/2nd:  .375 BABIP, 9.4 K/9, 1.8 BB/9

2007/1st:  .234 BABIP, 7.0 K/9, 2.2 BB/9
2007/2nd:  .357 BABIP, 8.8 K/9, 2.2 BB/9

I think it would be an interesting poll question:

If you had a “Last 5 weeks” fantasy draft today, who would you pick first, Dan Haren or Tim Hudson?

Or

Next year, who will you pick first, Dan Haren or Tim Hudson?

Derek Carty
13 years ago

Count me in for Haren 100%.  When you consider that what’s really wrong with his second-half is BABIP and that he’s basically been bad for about 175 second-half innings, that’s NOTHING in terms of BABIP.  You need 10 times that to account for even half of the inherent variation in BABIP!

Pat
13 years ago

Thanks Bobby.

I understand that fantasy baseball in a ton of luck and there are plenty of things that happen that cannot be explained.

I don’t think you can remove the human element from the equation. Players in general will regress toward the mean, but it does not mean it is true for everyone.

Why is it immpossible to draw reasonable conclusions that a pitcher pitches differently in certain situations? It is no where near fool-proof and may involve luck, but I think it is helpful in evaluating a player. What about some closers who pitch better when they are in a save situation as opposed to non-save situations? It is human nature, people react differently in certain situations.

Evaluating players is not as simple as anyalyzing flipping a coin. It works well as a blanket anyalsis, but to better understand each player, you need to look at other factors.

Derek Carty
13 years ago

Pat,
You ignore the underlying stats when you have legitimate reason to.  You seem something in his mechanics or in his approach or in his PITCHf/x data that indicates “Hey, this isn’t the same pitcher he was when he posted these past numbers.”  And even then, you wouldn’t ignore those numbers completely, just alter the way we’re projecting his future performance.

FanGraphs has the splits by month, but not by half.

And like Pat DiCaprio said, of course there is a human element to all of this, but pretending like we know what that element is for each individual player will only get us into trouble.  It’s comforting to be able to reason away troubles (or supposed troubles) by saying “oh, there must be something more going on here,” but that’s all we’re doing – comforting ourselves when we’d be better off holding our ground.

It’s not impossible to draw reasonable conclusions that a pitcher pitches differently in certain situations – it’s just that we have to be careful when doing it.  We have to have a legitimate reason OR enough data to do so.  In all of the cases we’ve used as example so far, I’ve seen neither.  In the case of Haren’s second half, we have 375 innings to look at (and no other reason besides the raw data).  That’s less than two years of data.  If another pitcher posts an abnormal BABIP over a year and half, are we all of sudden saying “There must be something wrong with him!”  Of course not.  Here’s an example.  From the second half of 2008 through the entire 2009 season, Justin Verlander posted a .330 BABIP.  Were we worried about Verlander coming into this season because of that?  If anyone was, I didn’t hear about it.  And that’s the way it should have been.  But because Haren’s bad innings happened to occur during the second half, we try to rationalize why this pattern is legitimate.  Maybe it is, but unless we have some insight into Haren’s head or some other bit of information, the best we can do is treat them like we would any other set of innings.

And I know we’re just talking semantics here, but everyone regresses.  Everyone.  It’s just a matter of what mean the player is going to regress to.  I imagine you’re arguing that for some players, like Haren in the second half, they’re going to regress to a different mean than they otherwise would.

Pat
13 years ago

I agree with what you are saying. There is not always an explanation (other than luck) as to why a pitcher’s ERA does not match the underlying stats. I know you can get yourself in trouble by trying to find a reason in something that does not have a reason. I think it is a worthwhile cause if uses properly.

Like your stated in Haren’s case, I understand that his 1st/2nd half splits are statistically reasonable. That the theory you are explaining is that if could put 2 Dan Harens in the MLB, in the exact same situation it is not unreasonable that 1 Dan Haren could put up the stats he is putting up now while the second one does not have any difference in 1st/2nd half splits.

I agree that the underlying stats should be the basis of looking at a player. It is 100% possible the only reason behind Haren’s 1st/2nd half splits is luck.

I just think Haren’s case is an exception. You are right with the regression to mean. I am arguing that some players regress to a different mean.

With the Verlander example, I remember he dropped in the 2009 rankings because of a bad 08. Also, in 09 even with the high BABIP he posted a 3.45 ERA and 1.18 WHIP, and struck out 269 guys. He had a high BABIP, but hitters had less chances to get hits because they struck out so much.

Derek Carty
13 years ago

Fair enough, Pat.  I would be interested to hear exactly why you think Haren is an exception, though.  What leads you to believe that?

Verlander might not have been a good example because his ERA was so good despite a worse-than-average BABIP, but I don’t think it would be hard to dig out a bunch more examples of guys who had an unlucky BABIP for a year and a half.  My point was mostly that over just a year and a half, nearly everyone is willing to write off BABIP as random variation.  Shoot, we’ve got nearly a year of Livan Hernandez BABIP luck and no one would come near him when I tried to trade him in LABR.  But when that kind of luck occurs during “second-half innings” – even if it’s the same number of total innings – suddenly it fits a pattern that we want to believe is legitimate.

Pat
13 years ago

The thing with Dan Haren is that there is drastic difference in 1st and 2nd half ERA and
it has continued for the last several years. His whole career minus 2006. Although, the 375 innings are techinally random statisitcs wise, they not completley random because they are taken after Haren has pitched 125 innings or so.
Does something happen to Haren after 100 or so innings? Dan Haren is not a coin that has the same odds on the 1st flip as the millionth flip.

As you mentioned you would need thousands of more innings for BABIP to stabilize. Therefore, we will never get to the point where we can say the 1st/2nd splits are a trend or not. I just think when it gets to a certain point, you need
to consider that maybe this player is different, but I can’t say that anyone is wrong for saying it is completeley luck.

Assume this Sept. 2009 and you are in the H2H playoffs.
You have to start either CC sabathia or Dan Haren.
Assume that you like both pitchers the same and that they are in identical situations. (vs the same team, same run support) etc, with the only difference being who is pitching.
Is there any reason to start Dan Haren? Why take the chance that his bad 2nd halves are bad luck?

The Livan example I don’t think works because he has several years of being bad. He has not pitched well for several years and I can’t see someone paying for this year.
Plus he is 35.