Statistical shenanigans (part 1)by John Beamer
March 02, 2009
I apologize in advance for an overtly sabermetric article. I have to vent. It won’t happen again, I promise—except in part 2.
Correlation and regression coefficients are perhaps the two most abused statistical measures by (baseball) analysts. How often do you see a baseball study quoting a correlation of 0.2, or an R squared of 0.49, and being told that the result is meaningful? Quite often I’d posit. Is it? Some studies say a correlation of 0.3 is strong, others dismiss it. Who is right?
As you probably imagine the answer isn’t simple. Regression and correlation are two very useful tools, no question. But one must be clear about their limitations before drawing meaningful insight from data. Today I want to share with you three lessons that I urge you to heed the next time you come across a study relying on correlation or regression. In part 2 we'll spend more time specifically on interpreting regression analyses—this article focuses mostly on R and R squared metrics.
(Note: Nothing I’m saying is this article is new. Others like MGL, Tango and Phil Birnbaum have written extensively and more lucidly than I ever could on this topic. What I want to do is to use real data to show some statistical watch-outs.)
Before starting let’s be clear on definitions. Correlation is defined as the degree of relationship between two data sets (technically it is the amount of shared variance between two data sets). A correlation is a unitless number ranging from 1 to -1. It is denoted by the letter R. An R of 1 implies a perfect relationship—if you were to plot the two variables you'd be able to draw a straight line through all the points. If you take a ton of towns and cities and plot the distance between them in meters on the x-axis and yards on the y-axis the R is 1 (clear relationship). If you plot meters on the x-axis and, say, height above sea-level on the y-axis, R is close to zero (absolutely no relationship).
R squared is a frequently used statistic. This is simply R, well, squared. If R = 0.5, R squared = 0.25, and so on. R squared is a common output from a regression analysis and is a measure of variance in the data. We'll expand on these definitions later. At this point all you need to know is that correlation and regression are intimately related.
Lesson 1: Understand the context
That’s enough on definitions.
Let’s dive into data and have a look at some correlations. A common technique to determine whether a team/hitter/pitcher has any talent is to run a year-to-year correlation. This works on the premise that if the talent you are trying to measure is a skill then a player who shows more of that talent in, say, 2008 will repeat in 2009.
We can do this for batting average. To overcome sample size limitations I took all player seasons going back to 1980 and created paired samples by using even and odd years. For instance, Bonds 1990 and 1991 is one paired sample; his 1992 and 1993 is another; and so on. This gives us 7384 paired samples—which should be enough to get a reliable correlation.
Below is the graph we get if we plot odd against even years assuming a cut-off of 100 at-bats in both years. (We’ll return to this assumption later.) This leaves 3,602 paired samples.
We can see a relationship and our calculator reveals an R of 0.37. We can interpret this as follows: If a player’s batting average is one standard deviation above the mean in year one then in year two it’ll be 0.37 standard deviations above the mean. To put some numbers to that if in year one a player hits .300, mean batting average is .267 with a standard deviation of .033, then we expect that player to bat .280 in year two.
Would you be prepared to take that at face value? Is an R of 0.37 a lot? What does it really mean? Is batting average a repeatable skill?
To probe these questions we must understand the context of the study. The first concern is the at-bat cut-off. By having a lower limit of 100 at-bats we’ll have a bunch of bench players and pitchers in our sample that could bias the data. After all, our intuition tells us that some players have more batting talent than others.
What happens if we push up the at-bat cut-off to 300 at-bats? R rises to 0.46. 400 at-bats? R increases again, this time to 0.50. If we keep increasing the at-bat limit we find a startling relationship: the more at-bats players have the stronger the year-to-year relationship.
This actually isn’t too surprising. Correlation is dependent on two factors. One is the spread of talent as, obviously, the more diverse the talent base the greater the likelihood that a relationship will show—think about it, if everyone bats .280 R will be zero. Two is the number of trials in each sample (number of at-bats) as this reduces the uncertainty in our measurement.
So by upping the at-bat limit the correlation improves as the error around each player sample decreases. The implication is profound though—we can get a vastly different Rs just by manipulating data differently. At an extreme if we had infinite trials our R would be 1—a perfect relationship! When looking at a year-to-year correlation it is important to understand the context.
Let me show you something else surprising. Reduce the at-bat cut-off to 30. What R do we get? Unbelievably it shoots up to 0.53. In fact, here is the correlation across a whole range of different at-bat cut-offs.
Cut-off # batters R 1 6996 0.33 10 5890 0.5 20 5420 0.54 30 5072 0.54 50 4506 0.51 70 4034 0.4 100 3602 0.37 200 2550 0.43 300 1801 0.46 400 1141 0.5 500 556 0.46 550 321 0.41 600 90 0.34
That doesn’t fit with the theory above. The issue here is deeper than the limit we place on the at-bats. There is something odd going on with the other determinant of correlation: the spread in talent.
It turns out that we are at the mercy of selective sampling. Think about it: by definition players with fewer at-bats are likely to be worse performers. Between 30 and 300 at-bats we are adding a ton of low-quality hitters that shifts the shape of the talent curve and makes the regression appear stronger. In our original lingo, the spread in talent has increased dramatically.
This confuses our conclusions. There is a danger that by using a cut-off of 30 at-bats we’d conclude that the batting average is a stronger talent than it really is. To prove the point the R between batting average and at-bats is 0.55, suggesting, rightly, that better players get more playing time. The old aphorism that correlation doesn't imply causation is certainly true here.
Lesson 2: Interpret correctly
Another correlation debate doing the rounds in stat circles was the conclusion by the authors of a book called Wages of Wins that payroll in baseball isn’t strongly linked to wins. To prove this they run a regression between payroll and wins and report an R squared of 0.18.
Their argument is that the R squared is quite small. In statistical lingo the variance in payroll only explains 18 percent of the variance in wins. Other factors such as luck, strength of the farm, weather, and God only knows what else—we are not told—account for the remaining 82 percent.
The issue is that we have no idea whether an R squared of 0.18 is meaningful or not.
The first test is to look at the study and apply lesson one: understand the context. If we only take the first two weeks of the season what will the data show? Not a lot I’d guess. It doesn’t need a post doc to work out that two weeks is far too short a period in which to measure talent. As we learned from the batting average study, the greater the number of trials the higher the R squared. Over a couple of weeks an R squared of 0.18 is a lot more impressive that if it was for two seasons.
As it happens the data spans 162 games or a season—does that make the 0.18 impressive? Bear with me ... but we simply can’t tell. Let’s revisit the spread in talent argument to work out how best to interpret the results.
Imagine the quite ridiculous situation where each team has the same payroll. Now even if all teams were of equal talent they wouldn’t all win the same number of games. Some would be lucky, others less so. Either way the R would be 0. Suppose one team added $7m to its payroll and wins a few more games as a result. An R squared of 0.18 in this case is quite impressive—after all just one team has accounted for all this variance.
On the other hand, each team could have vastly different payrolls but with a much looser association to wins. An R squared of 0.18, the SAME as above, in this context would be much less impressive.
The point is that we have two effects counteracting each other. The 18 percent could be caused by a really strong link between payroll and wins but little spread in payroll among teams, or could be caused by a large spread in payroll but only a slight link. We can’t tell by looking at the R squared alone! Let me repeat that. An R squared of 0.18 tells you absolutely nothing except that there is some sort of relationship.
We can glean a bit more information if we dig a little further into the data. Behind every regression stands an equation that gives us more information and the regression coefficients tell us the size of the effect. Here it transpires that a win costs $5 million. Is this a lot? I’ll leave that for you to debate.
How does this change if we increase or reduce the number of trials? The short answer is it doesn’t. Even if only use a week’s worth of data this $5 million stays the same. However, the uncertainty in our answer greatly increases. If we do a series of weekly correlations we’ll see that one week could give us $15 million a win while the next may give us $2 million a win. A longer time period will give us tighter confidence intervals, which means we more certain of the result.
Surely 18 percent R squared tells us something?
Yes, it does. It allows us to answer the question: how important is payroll when trying to work out how many games a team will win in a year? The answer is that payroll variance accounts for 18 percent of the total variance. In math speak if the standard deviation of wins in a season is 11 then variance equals 121. Taking away 18 percent and the remaining variance is 99—put in English knowing a team's payroll allows us to reduce the error in our estimate by one win.
What accounts for the rest of the variance?
Luck is probably the main factor. If we strip this out the maximum R squared we can feasibly get is about 0.5 (see note at end). In this context an R squared of 0.18 suddenly doesn’t look too shabby after all!
Lesson 3: Apply the results appropriately
The foibles of regression and correlation we discuss above illustrate perfectly why it is essential to understand regression to the mean when analyzing baseball statistics.
We saw above that the higher the number of trials the stronger the correlation coefficient. This has implications when we try to evaluate talent. If player A has hit .300 is 30 at-bats and player B has hit .280 in 300 at-bats, who is better?
To answer we must use regression to the mean.
The concept is straightforward but critically important. Dave Studeman wrote a very readable article on this a couple of years ago—it is, however, worth repeating. It is most simply illustrated by reverting to the batting average data we used earlier. If we chop the year one batting averages into performance quartiles and compare the average from year one to year two we should see a convergence towards the mean. (Note: we’re using a minimum of 300 at-bats.)
Quartile Year 1 BA Year 2 BA 1 0.242 0.260 2 0.268 0.270 3 0.287 0.279 4 0.314 0.293
The regression to the mean equation is simply:
R = Ave AB
Ave AB + X
So, for a 300 at-bat cut off we have Ave AB = 490 and R = 0.46. This allows us to work out X, which is 575. That means in order to estimate a player’s batting average in year 2 we have to add 575 “average” at-bats.
If we add 575 average at-bats at 0.277 we get:
Quartile Year 1 BA Year 2 BA Year 2 BA (Theoretical) 1 0.242 0.260 0.260 2 0.268 0.270 0.273 3 0.287 0.279 0.282 4 0.314 0.293 0.295
Hey—it comes out pretty close! Another test is to see how our correlation adjusts based on the number of at-bats in the sample. Below are the correlations we’d expect to see with out batting average data adjusting for sample size.
Cut-off Count R Expected R 1 6996 0.332 0.297 10 5890 0.499 0.33 20 5420 0.543 0.345 30 5072 0.536 0.357 50 4506 0.506 0.377 70 4034 0.404 0.395 100 3602 0.369 0.411 150 3008 0.404 0.433 200 2550 0.427 0.45 250 2164 0.438 0.464 300 1801 0.465 0.476 350 1489 0.476 0.486 400 1141 0.5 0.497 450 829 0.474 0.508 500 556 0.456 0.517 550 321 0.412 0.527 600 90 0.336 0.539
We can see that although it works well around between the 150-450 at-bat range, outside of this it breaks down. This is because above 500 at-bats the spread in talent becomes smaller (more elite hitters) and below 100 at-bats the spread in talent is wider (more quad-A players). We know that a hitter like Albert Pujols is going to be a lot lot better than Brad Ausmus. We’re not taking that into account as we’re just regressing to the overall mean.
This raises another important point, which is we must always regress to the most appropriate mean. There are a number of ways to approach this:
- Rather than use single season plate appearance we can use career plate appearances. A player who has more career at-bats is a better player so should be regressed to a higher mean
- Use any other available information when regressing, particularly if little at-bat information is available. Examples are: handedness, size, weight, line-up spot
I want to leave you with a two more profound insights that regression from the mean leads to but on the pitching side of the ledger:
- After half a season bullpen performance regresses about 75 percent to the mean. In other words it is very hard to tell anything about bullpen talent based on half a year's worth of data
- After a full season the amount to regress a starter's ERA is about 70 percent whereas for a stat like FIP it is closer to 40 percent. That doesn't mean that Johan Santana is suddenly going to register an ERA of 4.50 next year—he's got a decade of pitching seasons to regress to. But it does mean that Tim Lincecum might not be quite as good as we thought (although his 2008 FIP was especially impressive)
Rounding it All Up
Today we’ve covered some basic but critically important statistical concepts. Apologies for the heavy reading and especially to those who are fluent in these concepts—these concepts are pretty fundamental to any baseball analysis.
Happy data crunching, folks!
NOTE: CALCULATING MAXIMUM R SQUARED FOR PAYROLL AND WINS
R = var(expected)/var(observed) = 85/110 = 0.7 … and … R squared = 0.5.
There is also a more complicated method to regress to the mean outlined in The Book. This involves using mathematical gymnastics to compute the implied observed variance from each sample point (ie, each batter) and dividing the expected variance by the implied observed variance to get an R, from which you can work out the regression to the mean factor.
John is an unashamed glory supporter having followed the Atlanta Braves since 1991. He blogs the Braves at Chop-n-Change. He welcomes comments, criticisms and suggestions via e-mail