A couple weeks ago, I wrote an article titled “The control hitters have over LD%,” examining why it’s a bad idea to use single-year line drive rates in any discussion of a hitter’s underlying skills. Afterward, I received an e-mail from a reader who wanted me to go a step further:

Hi Derek,

I really enjoyed your post on the stability of LD% over time. It was very helpful to have the GB% correlation (.65) as a comparison. I want to encourage you to do a post at some point on the stability of a variety of common conventional and sabermetric stats; I fully understand the concept of looking for stable, repeatable skills but I have little idea what is stable and repeatable! For example, how stable is a player’s walk rate? Strikeout rate? HR/FB rate?Just a table of 20 of these stats would be really cool for perspective.

With that, here we go…

### The results

As I said last time, this is far, far from a comprehensive study. For comparative purposes, though, it can be quite useful. Anyway, I looked at all hitters from 2004 through 2008 who amassed at least 350 at-bats in adjacent seasons (and played on the same team both years, to eliminate some park-to-park biases). What you’re seeing is the R-squared results for each stat, which essentially tells us how much of the variation in Year 2 can be explained by the Year 1 figure.

+---------------------------+------+ | STAT | R2 | +---------------------------+------+ | Batting Average | 0.18 | | On-Base Percentage | 0.36 | | Slugging Percentage | 0.37 | | OPS | 0.35 | | ISO Power | 0.52 | | ISO Discipline | 0.60 | | Batting Average with RISP | 0.06 | +---------------------------+------+ | Contact (K) Rate | 0.76 | | Walk Rate | 0.61 | | HBP Rate | 0.37 | | Pitches per PA | 0.61 | +---------------------------+------+ | BABIP | 0.15 | | 1B per BIP | 0.21 | | 2B per BIP | 0.16 | | 3B per BIP | 0.26 | | AB/HR | 0.42 | | HR/FB | 0.59 | | GIDP Rate | 0.13 | +---------------------------+------+ | LD% | 0.09 | | GB% | 0.60 | | OF FB% | 0.52 | | IF FB% | 0.43 | +---------------------------+------+ | SBO% | 0.33 | | SBA% | 0.80 | | SB% | 0.10 | +---------------------------+------+

### Quick takeaways

As we always stress here at THT Fantasy, stats like batting average and BABIP are poor indicators of a player’s actual skill. It’s much better to focus on component skills like contact rate, which is one of the most stable stats around. Home runs are relatively stable, which might surprise some but really shouldn’t—after all, Juan Pierre isn’t going to start posting 30-home run seasons, nor is Ryan Howard going to hit only five home runs.

As we saw last time, line drive rate is very unstable, while the other batted ball stats are much more stable. And for those who like to blame hitters for being “unclutch” with runners in scoring position (I hear far too much of this from fellow Mets fans), check out no. 7 on the list.

### Quick glossary

**EDIT**: I’m adding this late per request. Sorry for some things being a little unclear to begin with.

**ISO Power**: SLG-AVG

**ISO Discipline**: OBP-AVG

**Contact (K) Rate**: Contact rate on a per AB basis (not a per pitch basis). Calculated as (AB-K)/AB

**HR/FB**: Home runs per outfield fly ball

**GIDP Rate**: GIDP/BIP

**LD%**: Line drives as a percentage of all non-bunt balls in play

**GB%**: Groundballs as a percentage of all non-bunt balls in play

**OF FB%**: Outfield flies as a percentage of all non-bunt balls in play

**IF FB%**: Infield flies as a percentage of all non-bunt balls in play

**SBO%**: Stolen base opportunity rate. The percentage of times a hitter reaches first and thus is in position to attempt a steal. Calculated as (1B+BB+HBP-IBB)/TPA.

**SBA%**: Stolen base attempt rate. The percentage of times a hitter attempts a steal given that he is on first base. Calculated as (SB+CS)/(1B+BB+HBP-IBB).

**SB%**: Stolen base success rate. The percentage of times a hitter is successful on a steal attempt. Calculated as SB/(SB+CS).

### Concluding thoughts

That’s all for today. Any questions, feel free to comment or e-mail me!

The Real Neal said...

“What you’re seeing is the R-squared results for each stat, which essentially tells us how much of the variation in Year 2 can be explained by the Year 1 figure.”

Huh? I am sure you’ve done some nice math here, but that sentence makes no sense. Let me give you an concrete example to illustrate.

Year BA

1 .278

2 .302

What you’re seeing is the R-squared results for each stat, which essentially tells us how much of the .024 can be explained by the .278.

Dave Studeman said...

I’m not sure what your example means, but the R squared measures how much of the variation among all players in Year 2 can be attributed to the variation among those same players in Year 1.

Seth said...

Brilliant idea for a piece. When doing research for my fantasy team next season, I will be sure to look up guys with high contact rates who have underachieved this season…could be another article even.

ThankYouMichaelLewis said...

I’m new to THT and it is fantastic, so bear with me if I can’t make as sophisticated inferences.

If LD% is so unstable, yet is has one of the strongest correlations with batting average/offensive succes (retrofitted), then is it the secret weapon in fantasy baseball drafting/projections?

In other words, if we see a player far off the mean LD% of 19%, could that be used as a primary indication as to how the player will perform the following season?

It’s almost as if it’s an anti-correlation in that it can be used to project performance in Year 2 if Year 1 is an outlier.

Thanks in advance for any clarifications.

Note: I’m not even a fantasy baseball player, but I figured it was an easy example of putting future projections in use.

Detroit Michael said...

“Pitchers per PA” is close to 1.0 for everyone in the league. I’m sure you mean “pitches per PA”.

I would guess that Batting Average with RISP appears to be more unstable from year to year than just Batting Average simply because the sample size, the number of PA we are using, for each season is smaller.

Derek Carty said...

Sorry for the confusion, Dave. I added a quick glossary. As to all the other studies, I’m sure there have been loads of them, so I knew I’d miss a whole bunch if I tried (if you have some links handy, though, I’d be happy to add them). This isn’t anything new, just a quick reference for the readers who were looking for one.

The Real Neal,

Dave nailed it. It’s a statistical tool that tells us how much of the variance for the player pool overall can be predicted by the one half of the data. If you’d like a longer explanation, just let me know.

Thanks, Seth (Also, please note that the contact rate I’m referring to is on a per-AB basis, not a per-pitch basis)

David Rasmussen said...

On statistics that are more luck based than skill based (low R-sq), like BABIP or LD%, the way to use them predictively is as follows. Someone has a high BABIP? His batting average for the rest of the year will likely decrease. Likewise, if someone has a high LD%, assume their rate stats will will go down. If you are interested in an individual player, compare BABIP and LD% to previous years to learn whether what they are doing may be sustainable.

Example:

Jason Bartlett: BA .332—not sustainable since BABIP is .383 versus career BABIP of .328. His LD% in 2009 is also not sustainable at 26.3%. Previous years are 20.7, 20.1, 22.2, 18.7. (Obviously, Jason’s good year is mostly luck based, but they stuck him on the All Star team, so it must not be obvious to everyone.)

Derek Carty said...

ThankYouMichaelLewis,

Glad to hear you’re enjoying THT. I’m always willing to help people who want to learn, so feel free to ask away whenever you have a question.

As to this specific question, David Rasmussen pretty much nailed it. LD% is a big driver of BABIP, but because it is so unstable, a LD% too far from league average is likely just good/bad luck itself. While it tells us *something* about the hitter, if we were to try to predict his LD%, we’d need to include a heavy proportion of league average, so a guy like Bartlett’s projected LD% going forward might only be 20-21% or so.

We do need to note, though, that for pitchers, BABIP will generally regress to .300. For hitters, everyone regresses to their own unique number (not necessarily .300!), so things become a little trickier to analyze. This is a very important point to remember that many analysts still don’t understand.

Derek Carty said...

Detroit Michael,

You’re right I changed “pitchers per PA” to “pitches”. Good catch.

You’re absolutely right on BA with RISP as well. If we’re looking at players with 350 ABs for the year, they might only have 150 ABs or so with RISP, so the number is much more unstable. If we were to look at all batters with exactly 350 regular ABs and all batters with exactly 350 ABs with RISP (given a large, fictional, perfectly-constructed-for-our-needs-data set), the correlations would probably be almost identical.

Jonathan said...

Derek –

Great question that you ask here.

From what you’ve got here, I’m guessing you did an auto-regression with 1 lag estimated using OLS, no?

If so (and perhaps even if not):

R-squared isn’t exactly the metric that we want for measuring repeatability. For example, you can have a high R-squared (meaning that the explanatory variables capture a lot of the explained data’s variance) and still have the coefficient on the lagged variable near zero (which means that next year’s statistic is likely to be near the league average even if this year’s wasn’t). In this case (high Rsq, low coefficient), the regression captures well that the stat is not repeatable.

Dave Studeman said...

I guess I’d make a few points here. One is that there are many ways to calculate something like this, as Jonathan pointed out. In the 2007 THT Annual (which you can read for free at Wowio), David Gassko used a binomial correlation in addition to the year-to-year correlation and found a higher figure (.32 vs .13 for line drives, for instance, which is what he and JC got from year-to-year correlation).

Over a career, or a “significant” amount of time, you will find differences between batters. Freddy Sanchez is a line drive hitter. Jason Giambi isn’t. That’s obvious, but it’s worth repeating I think.

Lastly, remember this analysis (and virtually all analyses like it) have been conducted for established major league players by necessity. They’re the ones we have the data for. If you were to expand the sample to include minor leaguers, or players with cups of coffee, you’d find that line drive hitting (and virtually all the other measures) are more predictable than these results indicate.

Derek Carty said...

Yeah, Jonathan, as I said, there are much better ways to do this sort of thing. This is far, far from perfect or comprehensive or flawless. All this is is a simple reference guide for those who haven’t seen anything like this yet. There are definitely flaws, but I’m wasn’t looking to be super precise. For comparative purposes, all I’m trying to do here is say “BA is unstable, contact rate is stable. BABIP is unstable, HRs are somewhat stable. LD% is unstable, GB% is stable. etc, etc.” The results this produces are roughly in line with what we get from a more complex study, which suits what I was going for.

ThankYouMichaelLewis said...

Thank you Dave and Derek.

When defining an offensive player’s lucky season, am I correct in assuming that LD% is the single biggest determinant, since it is what causes an abnormally inflated/deflated BABIP?

Also, what about defensive luck for positional players? I ask this because I still have a hard time with UZR due to it’s annual fluctuation

Could a pitcher’s unusually high LD% or BABIP cause a fielder to have a signifiantly lower UZR?

I think I’m mostly hung up on UZR because a guy like Teixeira grades negatively, yet I see him make game-saving plays every single night (but that’s for another article).

Colin Wyers said...

It should be remembered that all of these correlations are artificially high due to the 350 AB cutoffs used – that substantially reduces the variance and therefor increases the correlation. This is why a weighted correlation is preferable.

Jonathan said...

Gotcha on keeping things simple. I’d probably just report the coefficient on the lagged variable. Under the same assumptions you’re using, it would be just as informative. Under less restrictive assumptions, it would be more informative. Of course, your articles are in any case also extremely informative.

Dave Studeman said...

Are these stats defined anywhere? For instance, is OF FB% a percentage of all balls hit that are outfield flies, or a percentage of fly balls that are outfield flies? And what is SBO% and the other SB stats?

Last point: it would be nice to see references and comparisons with the many other studies of this that have been done in the past.