Without further ado, here’s the tool:
The white cells you see in the tool are the ones that you should be playing with. The Statistic and Year cells can be changed either by drop-down lists or by typing the name of the statistic directly (in the web app, it should help you narrow your choices when you start typing). Data should be entered directly into the other white cells.
As for the filters, the default setting considers a batter’s season only if they have 300 or more plate appearances in that season; you can set that as low as 100 PA, or higher if you’d like. The default year range is 2007-2013, but this can also be changed; but keep in mind these years affect the range of Year 0s, and that you should have Stat 1 set to year 0, or else you’ll be excluding some data you probably didn’t mean to. “Year 0″ implies the present season, while “Year 1″ implies the next season, and “Year -1″ implies the previous year. The three filter categories at the bottom each have drop-down lists, allowing you to simultaneously filter by three extra statistics of your choosing.
A quick refresher on correlations: they range between -1 and 1. A correlation of 1 means that when one stat goes up, so does the other, in a straight line on a graph like the type you see above. Correlate a stat to itself in the same year and you’ll see a correlation of 1; for something more useful, try to correlate same-year OPS and wOBA – it should be 0.993, and pretty dang close to a straight line.
A correlation of -1 should also appear as a straight line, except ninety degrees off from a correlation of 1; the two stats move in opposite directions. You’d get this if you correlated a stat to the negative of itself, for some strange reason. For something more practical, try K% vs. Contact% in the same year, which should come in at a very strong -0.888.
A correlation of 0 suggests that there’s probably no relationship between the two stats, although it is possible to for there to be an interesting relationship that escapes the correlation calculation. The graph will be harder to fool, however, so you may want to keep an eye out for strange patterns you see on it.
The Confidence Level box can also be changed. By default, it’s set to provide the estimated boundaries between which the true correlation is 95% likely to lie between. You’ll see this below it.
An Exercise in Batted Ball and BABIP Correlational Analysis
By default, you’ll see a comparison on the tool between batters’ PU% in one year and their BABIP in the next. PU%, if you’re confused, is Pop-Up percentage, my unofficial name for infield fly balls per batted ball (batted ball being defined as FB+LD+GB), as opposed to the official stat IFFB%, which is infield fly balls per fly ball. What you’ll notice is that PU% does indeed appear to be fairly predictive of BABIP, in that batters who pop the ball up a lot in one year will tend to have a low BABIP in the next (the correlation is -0.386 in the default sample). Makes sense, right? Of course, it helps a lot that PU% is a fairly predictable stat, with a year-to-year correlation around 0.638, as you can see. For comparison, LD%—line drives per batted ball—has only a 0.366 YTY correlation, while BABIP’s is 0.370. To summarize:
|Correlation with BABIP in Year:|
|Statistic||0 (Same Year)||1 (Next Year)||YTY Correlation (with itself)|
So, although LD% is a significant factor in same-season BABIP, its relative unpredictability makes it a much less reliable indicator of true-talent BABIP skills than PU%. This is also the case with pitchers, whose BABIPs are of course even less predictable.
If you’re curious, here are 2013′s relevant facts for each basic type of batted ball, straight from the league splits on FanGraphs:
|Batted Ball Statistics, 2013|
The low BABIP of fly balls in general might lead you to believe they are less desirable for a hitter than a ground ball. Don’t forget, though, that home runs are excluded from consideration in BABIP, meaning the batting average of a power-hitting fly ball hitter probably isn’t going to suffer as much as you might think. Clearly line drives get the best results, being low-risk with very high-rewards. Meanwhile, ground balls are medium-risk, low reward, and fly balls are high-risk, high reward; on average, though, FBs are preferable to GBs, as wOBA demonstrates. That’s not even taking into account the increased risk of double plays that comes with ground balls.
As a little bonus, here’s something I queried off of FanGraphs’ top-secret database: a more in-depth breakdown that uses more distinct batted ball types:
|Batted Ball Statistics, 2013|
In this classification system, the two types of “Fliners” are somewhere between fly balls and line drives, and there’s no overlap between the classifications. Relating these to what you see on FanGraphs: IFFB, OFFB, and FlinerF are all counted towards FB, while FlinerL and LD are counted towards LD.
Here, OFFBs are the really high outfield flies which—if they don’t clear the fences—are going to be caught 95.1% of the time. But home runs do occur on 11.1% of these high outfield flies, so you can’t discount them. Remember that these numbers are just averages; for a powerless batter, OFFBs are likely going to be a really bad thing; for a power hitter, they might actually be good. And try not to be confused—in this article’s correlation tool, FlinerFs are included as part of “OFFB.” I’m just not sure if it’s alright for me to let the details of this system out of the bag, unfortunately.
OK, now forget I mentioned all that stuff about fliners, because I’m going to be referring to the standard FanGraphs batted ball classifications from now on.
Back to BABIP: the main point of it is not to directly value a player, but to be an indicator of how lucky the player was. Skill does come into play, however, especially in the case of batters. But let’s take a look at how batted ball types correlate with a bonus stat I added into the correlation tool: Hits/Batted Ball, (let’s call it H/BatBall for short) which are hits divided by the sum of fly balls, line drives, and ground balls.
|Correlation with H/BatBall||Correlation with BABIP|
|Statistic||0 (Same Year)||1 (Next Year)||YTY Correlation (with itself)||0 (Same Year)||1 (Next Year)|
So, with home runs back in the equation, most of the predictiveness of the batted ball types—when it comes to the chance of getting a hit on a batted ball—completely disappear. Except for popups and maybe line drives (a little bit), that is. Also notice that HR/FB, while apparently useless for BABIP, is an important predictor of next-year H/BatBall. Not surprisingly, HR/FB is also a good predictor of wOBA (0.444 YTY correlation).
There are some interesting interactions here that take a multiple regression to weed out, though. Remember how I just said HR/FB is apparently useless for BABIP? Regression begs to differ; it outputs a formula for expected next-year BABIP of:
xBABIP = 0.083*HR/FB + 0.1*LD% – 0.55*PU% – 0.013*OFFB% + 0.007*Spd*GB% + 0.283
This formula has a 0.437 correlation with next-season BABIP, and 0.573 with same-season BABIP. More details on the factors:
|Predictive Factors Of BABIP|
|50% Values||95% Values|
|Statistic||Coefficients||Std Error||t Stat||P-value||Lower||Upper||Lower||Upper|
Translation: OFFB% probably doesn’t matter, but the other factors pretty certainly do, especially PU%, followed by Spd*GB% (well, Spd itself works almost as well, leaving GB% out entirely), then HR/FB, then LD%. So, you can cut out OFFB% to make:
xBABIP = 0.08*HR/FB + 0.1*LD% – 0.56*PU% + 0.008*Spd*GB% + 0.278
…which is practically equally good, with a 0.436 correlation to next-year BABIP.
It might also be a good idea to add current BABIP itself to the equation, to possibly help capture that certain je ne sais quoi about a batter’s BABIP, if simply predicting the next year is the goal. Handedness is likely significant as well. But I’ll save that for another time.
Well, hopefully I’ve given you all enough to play with and to think about for today. Tell us in the comments if you find out something interesting from your experiments!