Tool: Basically Every Hitting Stat Correlation

With this tool, you have the ability to compare any hitting stat in the field of play (via Joel Kramer).

With this tool, you have the ability to compare any hitting stat in the field of play (via Joel Kramer).

By popular demand (OK, one guy asked for it), my first offering in the re-envisioned THT is this batting version of my pitching statistic correlation tool (newer version here). This tool will allow you to see, both graphically and in terms of a correlation figure, how any two of FanGraphs’ batting statistics collectively relate to each other. With it, you’ll even be able to compare a group of players’ stats in one year to their stats in a different year. Comparing a stat in Year 0 to a stat in Year 1, for example, is a good way to gauge how predictive the first stat could possibly be of the other (just remember, correlation does not necessarily imply causation).

Without further ado, here’s the tool:

The white cells you see in the tool are the ones that you should be playing with. The Statistic and Year cells can be changed either by drop-down lists or by typing the name of the statistic directly (in the web app, it should help you narrow your choices when you start typing). Data should be entered directly into the other white cells.

As for the filters, the default setting considers a batter’s season only if they have 300 or more plate appearances in that season; you can set that as low as 100 PA, or higher if you’d like. The default year range is 2007-2013, but this can also be changed; but keep in mind these years affect the range of Year 0s, and that you should have Stat 1 set to year 0, or else you’ll be excluding some data you probably didn’t mean to. “Year 0″ implies the present season, while “Year 1″ implies the next season, and “Year -1″ implies the previous year. The three filter categories at the bottom each have drop-down lists, allowing you to simultaneously filter by three extra statistics of your choosing.

A quick refresher on correlations: they range between -1 and 1. A correlation of 1 means that when one stat goes up, so does the other, in a straight line on a graph like the type you see above. Correlate a stat to itself in the same year and you’ll see a correlation of 1; for something more useful, try to correlate same-year OPS and wOBA – it should be 0.993, and pretty dang close to a straight line.

A correlation of -1 should also appear as a straight line, except ninety degrees off from a correlation of 1; the two stats move in opposite directions. You’d get this if you correlated a stat to the negative of itself, for some strange reason. For something more practical, try K% vs. Contact% in the same year, which should come in at a very strong -0.888.

A correlation of 0 suggests that there’s probably no relationship between the two stats, although it is possible to for there to be an interesting relationship that escapes the correlation calculation. The graph will be harder to fool, however, so you may want to keep an eye out for strange patterns you see on it.

The Confidence Level box can also be changed. By default, it’s set to provide the estimated boundaries between which the true correlation is 95% likely to lie between. You’ll see this below it.

An Exercise in Batted Ball and BABIP Correlational Analysis
By default, you’ll see a comparison on the tool between batters’ PU% in one year and their BABIP in the next. PU%, if you’re confused, is Pop-Up percentage, my unofficial name for infield fly balls per batted ball (batted ball being defined as FB+LD+GB), as opposed to the official stat IFFB%, which is infield fly balls per fly ball. What you’ll notice is that PU% does indeed appear to be fairly predictive of BABIP, in that batters who pop the ball up a lot in one year will tend to have a low BABIP in the next (the correlation is -0.386 in the default sample). Makes sense, right? Of course, it helps a lot that PU% is a fairly predictable stat, with a year-to-year correlation around 0.638, as you can see. For comparison, LD%—line drives per batted ball—has only a 0.366 YTY correlation, while BABIP’s is 0.370. To summarize:

Correlation with BABIP in Year:
Statistic 0 (Same Year) 1 (Next Year) YTY Correlation (with itself)
PU% -0.468 -0.386 0.638
OFFB% -0.262 -0.213 0.754
LD% 0.418 0.187 0.366
GB% 0.192 0.226 0.788
FB% -0.356 -0.288 0.789
IFFB% -0.416 -0.350 0.555
BABIP 1 0.370 0.370

So, although LD% is a significant factor in same-season BABIP, its relative unpredictability makes it a much less reliable indicator of true-talent BABIP skills than PU%. This is also the case with pitchers, whose BABIPs are of course even less predictable.

If you’re curious, here are 2013′s relevant facts for each basic type of batted ball, straight from the league splits on FanGraphs:

Batted Ball Statistics, 2013
Type BABIP AVG SLG ISO wOBA
Line drives 0.683 0.685 0.878 0.193 0.681
Ground Balls 0.232 0.232 0.250 0.018 0.213
Fly Balls 0.124 0.213 0.616 0.403 0.346

The low BABIP of fly balls in general might lead you to believe they are less desirable for a hitter than a ground ball. Don’t forget, though, that home runs are excluded from consideration in BABIP, meaning the batting average of a power-hitting fly ball hitter probably isn’t going to suffer as much as you might think. Clearly line drives get the best results, being low-risk with very high-rewards. Meanwhile, ground balls are medium-risk, low reward, and fly balls are high-risk, high reward; on average, though, FBs are preferable to GBs, as wOBA demonstrates. That’s not even taking into account the increased risk of double plays that comes with ground balls.

As a little bonus, here’s something I queried off of FanGraphs’ top-secret database: a more in-depth breakdown that uses more distinct batted ball types:

Batted Ball Statistics, 2013
Type BABIP AVG SLG ISO wOBA 1B% 2B% 3B% HR%
IFFB 0.004 0.004 0.005 0.001 0.004 0.3% 0.1% 0.0% 0.0%
OFFB 0.049 0.155 0.531 0.376 0.288 0.8% 2.8% 0.7% 11.1%
FlinerF 0.280 0.362 0.889 0.528 0.530 7.7% 15.5% 1.5% 11.4%
FlinerL 0.627 0.631 0.870 0.240 0.652 42.9% 17.5% 1.6% 1.1%
LD 0.746 0.746 0.883 0.138 0.715 61.4% 12.6% 0.6% 0.0%
GB 0.232 0.232 0.250 0.018 0.213 21.5% 1.6% 0.1% 0.0%

In this classification system, the two types of “Fliners” are somewhere between fly balls and line drives, and there’s no overlap between the classifications. Relating these to what you see on FanGraphs: IFFB, OFFB, and FlinerF are all counted towards FB, while FlinerL and LD are counted towards LD.

Here, OFFBs are the really high outfield flies which—if they don’t clear the fences—are going to be caught 95.1% of the time. But home runs do occur on 11.1% of these high outfield flies, so you can’t discount them. Remember that these numbers are just averages; for a powerless batter, OFFBs are likely going to be a really bad thing; for a power hitter, they might actually be good. And try not to be confused—in this article’s correlation tool, FlinerFs are included as part of “OFFB.” I’m just not sure if it’s alright for me to let the details of this system out of the bag, unfortunately.

OK, now forget I mentioned all that stuff about fliners, because I’m going to be referring to the standard FanGraphs batted ball classifications from now on.

Back to BABIP: the main point of it is not to directly value a player, but to be an indicator of how lucky the player was. Skill does come into play, however, especially in the case of batters. But let’s take a look at how batted ball types correlate with a bonus stat I added into the correlation tool: Hits/Batted Ball, (let’s call it H/BatBall for short) which are hits divided by the sum of fly balls, line drives, and ground balls.

H/Batball Correlations
Correlation with H/BatBall Correlation with BABIP
Statistic 0 (Same Year) 1 (Next Year) YTY Correlation (with itself) 0 (Same Year) 1 (Next Year)
PU% -0.343 -0.265 0.638 -0.468 -0.386
OFFB% 0.006 0.030 0.754 -0.262 -0.213
LD% 0.289 0.104 0.366 0.418 0.187
GB% -0.034 0.004 0.788 0.192 0.226
FB% -0.089 -0.046 0.789 -0.356 -0.288
IFFB% -0.370 -0.296 0.555 -0.416 -0.350
BABIP 0.894 0.315 0.370 1.000 0.370
H/BatBall 1.000 0.420 0.420 0.894 0.315
HR/FB 0.466 0.323 0.706 0.075 0.038

So, with home runs back in the equation, most of the predictiveness of the batted ball types—when it comes to the chance of getting a hit on a batted ball—completely disappear. Except for popups and maybe line drives (a little bit), that is. Also notice that HR/FB, while apparently useless for BABIP, is an important predictor of next-year H/BatBall. Not surprisingly, HR/FB is also a good predictor of wOBA (0.444 YTY correlation).

There are some interesting interactions here that take a multiple regression to weed out, though. Remember how I just said HR/FB is apparently useless for BABIP? Regression begs to differ; it outputs a formula for expected next-year BABIP of:

xBABIP = 0.083*HR/FB + 0.1*LD% – 0.55*PU% – 0.013*OFFB% + 0.007*Spd*GB% + 0.283

This formula has a 0.437 correlation with next-season BABIP, and 0.573 with same-season BABIP. More details on the factors:

Predictive Factors Of BABIP
50% Values 95% Values
Statistic Coefficients Std Error t Stat P-value Lower Upper Lower Upper
Intercept 0.283 0.011 24.758 6.50E-110 0.275 0.29 0.260 0.305
LD% 0.100 0.034 2.932 0.003432 0.077 0.123 0.033 0.167
PU% -0.546 0.053 -10.325 5.14E-24 -0.582 -0.510 -0.650 -0.442
Spd*GB% 0.007 0.001 6.373 2.63E-10 0.007 0.008 0.005 0.010
OFFB% -0.013 0.019 -0.666 0.505428 -0.025 0.000 -0.050 0.025
HR/FB 0.083 0.017 4.866 1.29E-06 0.072 0.095 0.050 0.117

Translation: OFFB% probably doesn’t matter, but the other factors pretty certainly do, especially PU%, followed by Spd*GB% (well, Spd itself works almost as well, leaving GB% out entirely), then HR/FB, then LD%. So, you can cut out OFFB% to make:

xBABIP = 0.08*HR/FB + 0.1*LD% – 0.56*PU% + 0.008*Spd*GB% + 0.278

…which is practically equally good, with a 0.436 correlation to next-year BABIP.

It might also be a good idea to add current BABIP itself to the equation, to possibly help capture that certain je ne sais quoi about a batter’s BABIP, if simply predicting the next year is the goal. Handedness is likely significant as well. But I’ll save that for another time.

Well, hopefully I’ve given you all enough to play with and to think about for today. Tell us in the comments if you find out something interesting from your experiments!

Print Friendly
 Share on Facebook2Tweet about this on Twitter26Share on Google+0Share on Reddit16Email this to someone
« Previous: All Fly Balls Are Not Created Equal
Next: The Unfortunate Power of Labels »

Comments

  1. Vincent Jones said...

    “Good God” = what a non-rocket scientist says after hearing a rocket scientist speak. LOL

    I’m am really happy I stumbled upon this. The examples you provided are EXACTLY what I was gathering info on to figure out the past couple of days. I expected I’d have to do the work myself, but you did a lot of it for me and gave me a took to do the rest. It made me say Good God too, but I’ll get all of what you said there figured out eventually. Thanks! :)

  2. bob said...

    Plotting OBP vs Age, same year, I don’t see any evidence that performance declines with age. The highest OBP on the chart is someone over 40. What stat should I be looking at to see the alleged decline with age that people seem to always talk about?

    • said...

      That point you’re referring to is Barry Bonds in 2007, FYI — 42, with a 0.480 OBP. So, a bit of an outlier there.

      Yes, you’re right — there appears to be little to no overall correlation between age and OBP. But you have to keep survivorship bias in mind; the players still getting a significant number of PAs in their late 30s and beyond tend to be doing so because they’re still pretty good; all the players who had to retire or get too few PAs to qualify aren’t accounted for here.

      What is probably a more appropriate basis for that sort of analysis with this tool is to look at OBP in year 0 vs. OBP in all the surrounding years, while using age as a filter; that way, you’re looking at the performance of individual batters over time, rather than the overall characteristics of all the players in a given year. It’s kind of subtle, but if you set the age filter to, say, 30-50, then you’ll see the slope of the bottom left regression equation on the graph will be less than one for future years (meaning future OBP will be lower) and greater than one for previous years (meaning OBP was greater in previous years).

      • said...

        Better yet would be to download the spreadsheet, insert a column next to OBP in the ‘Data’ sheet, and divide the player’s OBP by the league OBP in that season (by creating a table that contains the league OBP in each year and doing a VLOOKUP on it). OBP relative to league average would make a better basis for the comparisons than OBP itself, as OBP has been on the decline since 2007 (probably due to increasing K rates, mainly).

  3. Ben Denissen said...

    This is gold. Absolute gold. After building financial models and regression tools for my firm for the last 2 years I’ve wanted so bad to have one at my disposal for baseball but didn’t have the dataset/willpower after 5pm to get it done. THANK YOU for doing this. I’m already overturning my previous-thought assumptions.

  4. birdwatcher said...

    Steve,

    Phenomenal work – a clarification please. Is HR/FB based on all FBs or just OFFB (so, excluding IFFB). Also, are total season speed scores published anywhere for all players ?
    Thanks.

  5. birdwatcher said...

    Thanks for the clarification. I agree – it should be HR/OFFB. Using total FB probably double counts infield flies since their negative impact should already be accounted for in the PU category. OK, so a new homework for you ??

    • said...

      See the tab at the bottom — between ‘Main’ and ‘Calcs’? That’s where you can add whatever stats you want, and they’ll then show up as options in the drop-down lists on the ‘Main’ tab.

  6. Daniel Brim said...

    Is there any way to include Jeff Zimmerman’s fly ball distance (from baseballheatmaps.com) into these correlation tools (both pitcher and hitter)?

    • said...

      Great idea! I stuck ‘FB Distance’ and ‘FB Angle’ in this one just now, with what I could gather from Jeff’s site. There’s some missing data there, but Jeff is going to send me his data when he gets the chance, and I’ll update it.

      • said...

        OK, Jeff was kind enough to send me his official fly ball distance data for batters today. I’ve added the following:

        FB Distance: the average distance of the batter’s fly balls and home runs

        FB Dist +1.5 stdev: 1.5 of the batter’s standard deviations above their mean fly ball distance, a.k.a. the 93.3rd percentile of their fly ball distance (assuming normal distribution). In other words, this is the theoretical borderline past which 6.7% of their fly balls should be hit further than this. It’s kind of arbitrary, but is an indicator of how far the ball might go when they really hit it well.

        FB Angle: the average angle of the batter’s fly balls, with -45 being the left field line, 0 being dead center, and 45 being the right field line.

        FB Angle (abs): the average of the absolute values of a batter’s FB Angle. A batter low in this stat tends to hit fly balls more towards center field.

        http://www.baseballheatmaps.com/graph/battedballdist.php
        http://www.baseballheatmaps.com/graph/leaderboard.php

  7. Grandpa Boog said...

    Interesting, but at age 88 I do not comprehend it. The most important state to me is the Game-Winning RBI, the one that put the team ahead to stay. Or it could be broken down to “RBI’s That Put His Team in a Tie or Ahead.”

    –Stay tuned.

  8. Chris B said...

    Nice article! Stat question though ; when you created these formulas at the end did you do any kind of forward or backwards selection? Meaning were all these predictor variables reasonably independent of the other predictors? I was curious why HR/FB would be included in the model if before it showed a very little correlation. Would it be correct to assume that while the correlation was low, it accounted for a unique part of the xBABIP variance? Thanks!

    • said...

      Thank you! Great question. Well, my decision to try HR/FB out in the regression was largely based on the intuition that power would probably make a difference to BABIP, with HR/FB being one of the better proxies for power that are available (though I might want to try one of the newly-added fly ball distance stats instead).

      I believe the explanation for why HR/FB apparently is predictive of BABIP in the multiple regression despite no direct correlation is this:

      HR/FB’s correlation vs. next year’s…
      GB%: -0.32
      Spd: -0.29
      LD%: -0.15
      FB%: 0.36

      So, it’s the fact that the big home run hitters tend to be slower and hit more low-BABIP-type batted balls (more FB, fewer GB and LD) that hides the fact that power in and of itself is actually beneficial to BABIP. If you can combine speed and power (e.g. Mike Trout), your BABIP will likely be pretty high.

  9. said...

    Stephen,
    I ran your xBABIP’s on last year’s data and they all seem very low. For example, Miggy was at .300 even though his BABIP last year was .356. The old xBABIP formula (listed below) is much closer at .344. If i add .330 at the end of your equation (vs. .283/.278 depending on which equation in your post) it outputs .347 which is much closer to his actual BABIP and old xBABIP (as well as everyone else’s).

    Am i missing something or did you come up with the same results?

    Thanks!

    old xBABIP formula:
    xBABIP = 0.392 + (LD% x 0.287709436) + ((GB% – (GB% * IFH%)) x -0.152 ) + ((FB% – (FB% x HR/FB%) – (FB% x IFFB%)) x -0.188) + ((IFFB% * FB%) x -0.835) + ((IFH% * GB%) x 0.500)

  10. Jesse said...

    Well, I’m not sure if I am happy or not that I stumbled across this site. If staying on a site for more than three hours at a time is good, then I’m guilty.

    Just a question I hope someone can answer: In terms of all the stats presented here and on fangraphs, which 10 or 15 (from first to 15th) are the most accurate predictors on how a hitter will do from day to day? I am doing research and have come across various answers. Some say success against a certain pitcher is very important, others say it’s too small a sample size (even with >50 at bats) to rely on.

    What I plan to do is once I figure out the most to least important stats for determining “success” of a batter, is to multiply the stats by weighted factors. The top influential stats will bear more heavily into the final factor, and the lower rungs have less input.

    Any response would be greatly appreciated. Thanks for a great resource!
    J

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Current ye@r *