WAR vs. wins

Over at FanGraphs, Dave Cameron took a look at how well WAR and actual wins match up this year:

For 2009, the correlation between a team’s projected record based on their WAR total and their actual record was .83. This is a robust number, especially considering that WAR is almost completely context independent and currently includes some notable omissions – base running (besides SB/CS, which are included in wOBA) and catcher defense are both ignored in the calculations. We also don’t have an adjustment for differences in leagues, so we’re not accounting for the fact that the AL is better than the NL.

Despite these imperfections, WAR still performs extremely well. One standard deviation of the difference between WAR and actual record is 6.4 wins, and every single team is within two standard deviations. Only four teams were more than 10 wins away from their projected total by WAR, with Tampa Bay ending up the furthest away from our expectation (96.6 projected wins, 84 actual wins), and 18 of the 30 teams were within six wins of their projected WAR total.

If I’m reading this correctly, Dave is essentially saying that because a teams win totals and their projected record via WAR are generally very close, it shows that WAR “works”. The commenters at FanGraphs seem to agree with him as well.

Maybe, it is just me, but I don’t understand why a high correlation between WAR and wins would verify the accuracy of the stat. The whole point of WAR is to attempt to separate luck from controllable skills. With a teams win totals being so highly influenced by such things as bad luck on balls in play and timing, you wouldn’t expect WAR to have a high correlation with wins. In fact, as Dave later shows in the article, Pythag record has a higher correlation to win totals than WAR. Does that mean that Pythag works better than WAR? Of course not; it just takes out less of the variance in actual wins than WAR, so it will naturally correlate better.

Let me be clear, I’m not saying WAR doesn’t work. It absolutely does. FanGraphs implementation of WAR, if I’m not mistaken, uses the average Linear Weights run values of each event for the year. That separates it from other stats like OPS or RC, which don’t necessarily have any empirical meaning, because it will literally match up perfectly with runs and wins in the aggregate. If you want a metric that shows how well your team would have played that year if timing was taken out of the equation, WAR is your guy.

I’m sure Dave knows all this, which is why it’s confusing to me to see WAR being compared to win totals to show it’s accuracy. Indeed, you would almost rather have WAR correlate poorly with win totals, as it strives to strip away all of the luck associated with them. WAR and wins measure two different things, and saying that the former works because it correlates with the latter makes no sense. A better test of WAR, in my opinion, would be to show how it projects future wins, because you have no reason to expect a teams good or bad luck to continue.

Print Friendly
 Share on Facebook0Tweet about this on Twitter0Share on Google+0Share on Reddit0Email this to someone
« Previous: The Inaugural THT Live Roundtable
Next: Pitcher hitting »

Comments

  1. Patrick said...

    I, um, hmm.  How best to explain this.

    While we shouldn’t be worrying over one years worth of correlation data, isn’t, a priori, a context-irrelevant stat that correlates better with actual wins – over the largest data set available – a better stat than one that doesn’t?

    Doesn’t that mean it’s doing a better job of expressing everything that goes in to winning that isn’t luck?

    The only way that a higher correlation to winning could be a bad thing would be a case of overfitting to the data…  Which is avoided both by using a larger data set and by having a stat that is derived from components and not specifically modified to fit to post-hoc Wins data.

    No, correlating to win data is a very good thing…

  2. kds said...

    All good points.  But we have some pretty good estimates of context dependencies and other types of luck. If the correlation between wins and WAR was much worse, say a standard deviation of 20 games we should be very worried because that would not match our estimates of luck.  S.D. of 6.4 looks about right to me.  I would be concerned if it were much more or less.

  3. Nick Steiner said...

    Good point kds.  I guess I was peaved with Dave using the high correlation as a proxy for how well WAR “performed”.

    BTW, 6.4 games is also the SD of true talent level using the Binomial Distribution.  Interesting coincidence.

  4. Colin Wyers said...

    Patrick, you’re basically right. A strong relationship with observed team wins is a necessary part of the accuracy of individual player WAR.

    It is, however, not sufficient on its own – you can create individual player metrics that are horrible and yet sum to team wins perfectly. (This is exactly how RBIs work – they by definition sum to team runs, because almost every run scored is assigned to a player as an RBI. That doesn’t mean that RBIs are accurately assigned to individual players.)

    So to claim accuracy you have to show that both your process and results are “correct.”

  5. Colin Wyers said...

    You do not care about stripping out luck. What you want to do is show how a teammate contributed to team wins, assuming he was surrounded by average teammates.

    And, data. I used Rally’s WAR data instead of Fangraph’s because I already had the data in a database. Using Pythagenpat, RMSE with same-year wins, 1996-2008:

    WAR: 5.61
    Pythag: 3.91

    RMSE with next year’s wins:

    WAR: 11.96
    Pythag: 10.85

    In other words, Pythag typically does a better job of predicting wins (either same-season or y+1) than WAR. Which is fine, because that’s not what it’s there for. But Cameron claimed otherwise on a recent USSM post and it’s stuck in my craw for a few days.

  6. Ken said...

    This correlation is essentially meaningless in terms of identifying the quality of WAR as a measure. The correlation with net batting average (BA – BAA) is 0.78, yet nobody thinks that batting average is a great statistic. And the correlation with net OBP (OBP-OBPA) is 0.825.

    One can say that WAR works, except that it really doesn’t improve on On-Base Percentage. Everything else that goes into WAR offers essentially nothing in terms of our knowledge of team aggregate talent.

    And as far as individual players go, the relationship between aggregate error and individual error depends on how that error correlates across teams. If individual errors are uncorrelated within a team, then this would be the result for almost any level of error in the WAR statistic.

  7. HH said...

    I think Dave was trying to prove to himself that the Mariners really are better than their Pythag W-L indicates. He wrote an article about the Mariners WAR value and tried to dismiss the Pythag based on that concept. Then he went a step further with the Fangraphs article.

  8. Nick Steiner said...

    Patrick, I agree that WAR should have some correlation to actual wins, so that we know it isn’t too far off base.  However, that correlation in itself is meaningless unless we can compare it to other metrics.

    The problem I have with Dave’s article is that he essentially says WAR is good because it has a high correlation win wins.  As Ken points out, that is definitely not correct, as you could construct a metric that correlates better with wins by using OBP. 

    Maybe it’s just semantics, but when Dave says things like “WAR still performs extremely well”, based soley on the comparison to wins, that really eats at me.  When you use wins as a judge for how “good” WAR is that will lead people to wrong conclusions.  Just look at some of the comments at FanGraphs:

    http://www.fangraphs.com/blogs/index.php/war-it-works/#comment-101227

    Is WARP better than WAR because it correlates better with actual wins?  Probably not, there is a lot of evidence that WAR is better than WARP. 

    I think Shawn Hoffman has it right here:

    http://www.fangraphs.com/blogs/index.php/war-it-works/#comment-101252

  9. Adam Guttridge said...

    A couple points worth making here, and I guess I’m responding to the Fangraphs comments as much as those here… but similar issues have been raised.

    A) I think, as a rule, we still have a bit of a mental blind spot for the use of correlations. Colin is right that WAR is not TRYING to predict team wins, but it should be a natural byproduct given a large enough timescale. Still… whether the correlation is .92, .94, or .87 is not any answer in and of itself.

    But with WAR, it would be all thrown out of whack by something as ‘simple’ as using the wrong replacement level, overvaluing defense, not including baserunning (all things Fangraphs is guilty of, IMO). That’s why I don’t like people talking about WAR as if it is one stat. My version of WAR is not Fangraphs version, their version is not someone elses version.

    B) One major, obvious reason their current construction will stink at predicting win totals in an upcoming year; you don’t use WAR to predict WAR. You project components of WAR (offense, defense, baserunning) individually and then sum them to predict a 2010 WAR. The difference? Regression. You need to regress the hell out of defense, and you should probably do some sort of BABIP normalization. Check my recent JJ Hardy article: http://www.hardballtimes.com/main/blog_article/projecting-jj-hardy/

    If one were to predict his 2010 WAR simply by weighting his last 3 seasons of WAR, you’d be missing a lot.

  10. Colin Wyers said...

    Adam – using the wrong rep-level should have no effect on team-level correlations (which is why BPro’s WARP does so well in them). I’m curious – where do you think the rep-level should be set for WAR? (Fangraphs uses about .290, I think Baseball-Projection is closer to about .330.)

    And why do you think defense is overvalued? The methodology is sound, although certainly the uncertainty is higher in our estimates of defense.

  11. Nick Steiner said...

    Yes Adam, we know that it’s neccesary to include things like roster turnover, regressions, BABIP luck, etc.  However, if you only had one season to go by in terms of projecting a players true ability, you would use the metric that best captures true skill in that year, that will corelate better in comparison to other, lesser, metrics in that category. 

    So Colin showing that WAR is less predictive than Pythag pretty much means that WAR isn’t a great team stat.

  12. Dave Studeman said...

    If correlation with team wins is important, then Win Shares is better than WAR.  Seriously, I agree that it’s easy to build a metric that correlates with team wins.  I haven’t read Dave’s post, so I’m not commenting specifically on his point.

  13. Patrick said...

    Dave,

    I’m not sure who you’re responding to, but by suggesting that we use win shares, you’re knocking down a straw man or missing the point.

    In using win shares, you aren’t correlating to anything – You’re just USING wins.  That’s not the same thing.
    We’re talking about constructing a metric that then in some way models real wins.

    Totally different and I think you know it.
    —-

    Excuse me!

    “Is WARP better than WAR because it correlates better with actual wins?  Probably not, there is a lot of evidence that WAR is better than WARP.”

    Nick, what sort of evidence?

    And to those who made the point about WAR being for individual contributions:

    Absolutely.  The question is the purpose of WAR, which is of course, primarily to determine the relative contributions of different team members.  But if by adding defense and everything else in to the system, we can’t correlate any better to actual wins than team OBP (which I believe I read above)…  Then haven’t we failed to add anything?

    Doesn’t that essentially invalidate all of those fancy additions?

    If we’ve reduced the precision of our estimate by adding them, then shouldn’t we not have added them in the first place?
    Perhaps I’m wrong in thinking that team OBP is essentially implicit in team WAR calculations.  I know it’s not directly in there since WAR is calculated on a per player basis – But from an information theory perspective, all of the same information (the OBP of every member of the team) goes in, as well as quite a bit more.

    If we can’t do better at predicting team wins with all that information…  Haven’t we done something wrong?

    But again, perhaps the fact that WAR is more about giving individual credit for the team wins covers this sin.

    But how, then, do we judge that WAR as a stat is working?  If the numbers are correct.

    What evidence is there that WAR is better than WARP? etc.

    I think you guys have done a lot of beautiful thinking and I bet all of my questions have been answered – no doubt very thoroughly by some very good brains – but that doesn’t mean I’m not curious to ask the questions!

    Thanks. smile

  14. Patrick said...

    Err, this isn’t very clear:

    “But how, then, do we judge that WAR as a stat is working?  If the numbers are correct.”
    How do we judge if the numbers are correct for individual players.  Sorry. smile

  15. Nick Steiner said...

    I meant for individual players Patrick.  WARP uses a replacement level that is too low, it undervalues the walk and it uses an inferior defensive metric to UZR (although WAR doesn’t necessarily have to use UZR, FanGraphs’ does though).

    I think Colin could explain to you with some more proof why FanGraphs WAR is almost certainly a more robust statistic than WARP.

  16. Patrick said...

    Nick,

    OK, that makes sense.  For individual players.

    So…  How do we define UZR as superior?  I’m not saying it isn’t, but incorporating more components isn’t the same thing as being better!

    I wonder – Is there a reasonably detailed explanation of UZR hanging around, and explanations of why it’s good and what work was done to show that.

    Too many very smart, thorough people have a lot of faith in it for me to think they haven’t done the work.  But I think I might go and see if I could find some of it.

    I’m very curious to learn how they did it.  It seems like a sticky sort of problem.

    And all right.  Thanks!

  17. Patrick said...

    “I’m very curious to learn how they did it.”

    “It” being optimized UZR and found benchmarks against which to optimize and refine it.

  18. Nick Steiner said...

    Here are some links to the methodology:

    http://www.baseballthinkfactory.org/files/primate_studies/discussion/lichtman_2003-03-14_0/

    http://www.baseballthinkfactory.org/files/primate_studies/discussion/lichtman_2003-03-21_0/

    These are written by MGL and details the methodology.  Here is nice article by David Gassko in which he takes a look at some of the major defensive stats:

    http://www.hardballtimes.com/main/article/evaluating-the-evaluators/

    As far as I know, there haven’t been any conclusive studies that show UZR to be the best one (I don’t even really know how you would do that).  Much of the validity in UZR relies on it’s methodology, which is very solid.

  19. Colin Wyers said...

    Patrick,

    We already have many ways to measure team wins, most notably team wins. The fact of the matter is that we can directly observe team wins – they’re pretty easy to study.

    WAR is an attempt to INDIRECTLY observe how individual players contribute to team wins. Correlation with team wins is necessary to that, but it is insufficient to establish the accuracy of an individual player metric.

    In other words – an “uberstat” that correlates poorly is obviously a bad stat, but one that correlates well is not necessarily a good stat.

  20. Dave Studeman said...

    I’m not sure who you’re responding to, but by suggesting that we use win shares, you’re knocking down a straw man or missing the point.

    My point, Patrick, is that correlation to wins means nothing in and of itself. I am not knocking down a straw man nor am I missing the point.

  21. Patrick said...

    Dave,

    It means a lot – It’s a necessary but not sufficient condition for a good individual contribution stat.

    If a stat like WAR doesn’t correlate reasonably well with wins, then it can’t be getting individual players contributions right.

    Correlating reasonably with wins doesn’t mean you have a good stat for measuring individual contributions.  Not correlating reasonably well with wins means you do NOT have a good stat for measuring individual contributions.

    How can we be correctly measuring relative individual contributions if the total they come to doesn’t correlate with wins?  (It could by they come to a number on a different scale, but that’s why we say correlates..  And we can scale it to wins, then.)

    Simple: We can’t.  That’s why it matters that these stats correlate at least OK with wins.  All other things being equal, correlating better with wins is better.  They aren’t equal, of course!  And it’s possible that in doing a better job of teasing out individual contributions, we lose some correlation with overall wins.  OK.  That happens…  but…  It isn’t a GOOD thing.  It’s a – very slightly and probably outweighed by other things – bad thing.

    WAR is a better stat than WARP.  But that it correlates worse with real wins is a point against it.  Other points in its favor are stronger, I hear.  But that doesn’t mean that it correlating worse with wins is a GOOD or even a neutral thing.  It’s just a small bad one.

  22. Patrick said...

    Colin,

    I feel really stupid!  I just repeated your post, less eloquently and with a lot more words.  Oops.

    Very succinct and I agree completely, as I noted in my reply to Dave Studeman.

    Thanks!

    And Nick…

    Thanks!  I’m going to eat those up, I really appreciate the links.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>