Sabermetrics: A Plea for Tolerance

Zack Cozart is a good fielder, but it's harder to tolerate him at the plate. (via Patrick Reddick)

Zack Cozart is a good fielder, but it’s harder to tolerate him at the plate. (via Patrick Reddick)

To get the disappointing news out of the way, this article does not contain the latest hot take on Yasiel Puig or the perceived death of baseball.

It does address a statistical topic that is becoming more widely appreciated, but not well understood. That topic is “multicollinearity”—an ugly-sounding term that can be problematic sometimes, but is also not as big a problem as some people seem to think.

Multicollinearity arises in the context of statistical regression. Regression is one of the most popular methods used in baseball research, helping us reveal actual—rather than merely suspected—relationships between baseball statistics. With regression, we can demonstrate that on-base percentage predicts run-scoring better than batting average, confirm that strikeouts and walks are highly predictive of a pitcher’s earned run average, and so on.

There are many forms of regression, but in the baseball world, linear regression is often the most useful. There are many reasons for this: (1) many baseball statistics are continuous variables, and thus well-suited for linear regression; (2) linear regression allows us to “control” for variables that otherwise would hide important effects; (3) the simplicity of linear regression makes it easier to model future accomplishments; and (4) unlike more complex methods, linear regression allows us to use statistical significance to determine if perceived relationships are actually meaningful.

At the same time, linear regression makes a number of (generally reasonable) assumptions about the data being analyzed. It assumes that the “predictor” variables have a linear relationship with the “outcome” variable. It assumes that the data are  normally distributed. It assumes that when a prediction is wrong, it will consistently be wrong in the same ways. Finally, and particularly relevant to our discussion today, linear regression assumes that the predictors work independently of one another, and are not trying to explain the same phenomenon—in statistical lingo, this is known avoiding “multicollinearity.”

So let’s talk about multicollinearity. As FanGraphs and Baseball Prospectus, and others such as Pete Palmer, Bill James, etc., have increased baseball literacy, they have also increased statistical literacy. More and more readers are thus aware of multicollinearity, at least in a general sense. They know we need to “be careful” with predictors that seem to analyze “the same thing.” What they don’t often appreciate is that this overlap usually doesn’t matter, or how to tell when it actually does. In defense of those readers, baseball writers rarely take the time to reassure readers that they have taken reasonable steps to control multicollinearity in their models.

Let’s create an example. Assume that I want to predict which factors affect a team’s winning percentage in one-run games. To do so, I compile the records of all major league teams in one-run games over multiple seasons, and propose to predict each team’s one-run game winning percentage (the “y” or “outcome” or “dependent” variable) from the following team statistics: the pitching staff’s Reliever SIERA and Quality Start Percentage, the quality of the team’s defense (by UZR) and their lineup’s Isolated Power and Clutch Hitting (together, these other statistics are each an “x” or “predictor” or “independent” variable).

If the resulting article receives any significant readership, then I can predict, even without a regression, that certain readers will worry about the inclusion of both Isolated Power and Clutch Hitting as predictors. Many clutch hits are in fact extra-base hits, so perhaps the two variables are trying to explain the same thing. Some readers would state this question politely; others would be less polite, branding the entire exercise “worthless” or something similarly uplifting.

Before we address this particular hypothetical, we first need to understand why multicollinearity is often not that big of a deal. That is because there are basically are no completely independent or dependent variables, and as long as you are somewhere in the middle—which most pertinent variables are, in relation to each other—you are probably fine. In the big picture, multicollinearity is only problematic if the predictors do not overlap in a similar way when we apply the model to new data. At the same time, multicollinearity is still worth minimizing if possible. Multicollinearity makes models more volatile, can obscure the true causes of a phenomenon, and can make models more complex than they need to be. When it comes to modeling, simpler is usually better.

So, what do we do? From the writer’s perspective, we could “ignore the comments” and ignore these comments in particular. But most of us write to persuade, and if people are getting hung up in the middle of your article, they probably are not being persuaded. They may not even finish the article. When reporting statistical results, we customarily report the makeup of our samples, the methods we used, and the statistical significance of the results. Informed readers pay close attention to those disclosures. Isn’t it time we gave readers the same assurances when it comes to multicollinearity?

My recommended solution is to report the “tolerance” among the predictor variables when publishing the results of any linear regression.You don’t hear about Tolerance much, but it is both simple and highly useful (Be sure not to confuse “tolerance” with “tolerance intervals,” which are something different.). Tolerance is defined as:

ll

where j is the coefficient between each predictor variable and the remaining predictor variables, taken together. Tolerance allows both the author and reader to quickly assess the independence of the predictors from each other, with values being reported on the familiar 0 to 1 scale. In general, Tolerance values of .2 or less for a predictor indicate a multicollinearity issue, and a Tolerance value of .1 or less is a red alert. If you see Tolerance values around .1, you need to seriously consider some other predictors (unless the predictor in question serves solely as a control variable). Even at .2, you may want to make some adjustments.

Instead of Tolerance, some of you may be more familiar with the so-called “variance inflation factor,” or VIF. The VIF is the reciprocal of the Tolerance:

A Hardball Times Update
Goodbye for now.
222

This means that VIF is also scored on a reciprocal basis from Tolerance, so instead of .2 as the level of possible concern, researchers use a VIF of 5 (1/.2) as the equivalent level of minimum concern, and at a VIF of 10+ (1/.1), we become very alarmed about multicollinearity.

Of the two expressions, I prefer Tolerance, although either one is certainly fine. Most people’s preference probably depends on what they are used to seeing. I like that Tolerance uses the familiar range of 0 to 1. But use whichever one you want, and since they are so closely related, researchers might wish to report the results of both.

Like most statistics functions, both Tolerance and VIF can be calculated through common statistics programs. In R, there are multiple packages (such as “car”) that will calculate the VIF for any linear model you’ve created (as far as I can tell, Tolerance is not available as a pre-defined function in any R package, which is bizarre. But, you calculate it by just using “1/VIF” as your function.) If you prefer Microsoft Excel, you will need to download an add-in package. I recommend the free Real Statistics package, which adds Tolerance, VIF, and a host of other useful statistical functions to Excel’s formula list.

Now that we have defined our tool(s), let’s return to the original example and address the reader’s hypothetical concern. Are Clutch Hitting and Isolated Power collinear, such that they are not appropriately considered together in the model? Let’s find out.

You’ll recall that we were looking to predict the likelihood of a team’s winning percentage in one-run games (the “outcome” variable) by looking to these “predictor” variables: (1) Reliever SIERA, (2) Team Defense (by UZR), (3) Quality Start %, and of course the lineup’s (4) Clutch Hitting and (5) Isolated Power. We’ll use R for this example.

Inside R, we create the following linear model:

> SS1.lm <- lm(WP.1R ~ R.SIERA + Def. + QS. + ISO + Clutch)

We then look at Tolerance and VIF, respectively, for the various predictors, and round the results to two decimal places:

Tolerance

R. Siera Def. QS ISO Clutch
0.91 0.88 0.89 0.96 0.95

VIF

R. Siera Def. QS ISO Clutch
1.10 1.14 1.13 1.04 1.05

As it turns out, these predictors are remarkably independent from one another. For Tolerance, remember that the best possible score of independence is 1.0, and all of these predictors are at .88 and above. With .2 as the minimum for reasonable independence, we have no concerns about this model in that respect. Similarly, for VIF, we are looking to avoid values of 5 and higher, and these are nowhere close to that. Contrary to our hypothetical reader’s concern, we have no multicollinearity problem here. The perceived measurement overlap between Clutch and ISO really does not exist, the two measures are measuring very different things, and this model is worth exploring further.

Let’s try a different example that demonstrates an actual multicollinearity problem. This time, we’ll use Excel with the Real Statistics Add-in, and the following table, which attempts to predict weighted on-base average (wOBA) (these 2014 wOBA calculation from FanGraphs include their derived coefficients as described on the Guts! page.

wOBA Predictor

Name HR BB% ISO BABIP AVG OBP wOBA
Andrew McCutchen 25 13% 0.228 0.355 0.314 0.41 0.412
Zack Cozart 4 5% 0.079 0.255 0.221 0.268 0.254
Predictor No. 1 2 3 4 5 6
TOLERANCE 0.09 0.05 0.09 0.36 0.04 0.02
VIF 10.9 22.15 11.1 2.75 24.49 43.78

I’ve truncated the table for legibility, but the missing rows contain the remaining qualified hitters from the 2014 season in between the best (McCutchen) and worst (Cozart). For all qualified hitters, I’ve chosen various statistics from the FanGraphs dashboard that address overlapping area of batting offense. For example, home runs and walk percentage contribute substantially to OBP, and BABIP contributes heavily to batting average. Moving down the rows, I’ve numbered each predictor variable, and then used the “Tolerance” and “VIF” functions from the Real-Statistics Excel pack to calculate the value for each predictor column.

As you can see from the “Tolerance” row, we have a series of values that with the exception of BABIP, are less than ideal as fellow predictors in the same model. The scores are consistently below the .20 threshold of Tolerance concern, and even below the .10 level. Similarly, with five being the level of concern with VIF, and 10 being rejection-worthy, these same statistics are unacceptable as we would expect from that perspective too. This doesn’t mean we can’t regress these predictors against wOBA, but it does mean we should cull out a few of them to reduce the overlap. The nice thing about doing so in Excel is that we can delete and restore columns as desired (through Excel’s Undo / Redo functions) and watch our Tolerance / VIF values dynamically move toward acceptable ranges. For those who prefer the visual layout of Excel, this is actually a great way to sift through predictors in a linear model before you begin the dirty work.

I’ve made a few changes, and let’s look at our values now:

Updated wOBA Predictor

Name BB% ISO BABIP wOBA
Andrew McCutchen 13% .228 .355 .412
Zack Cozart 5% .079 .255 .254
Predictor No. 1 2 3
TOLERANCE .80 .81 .99
VIF 1.25 1.24 1.01

This is a much better look. We removed three columns that were duplicating other statistics—home runs, batting average, and on-base percentage. The Tolerance values for what remains—BB%, ISO, and BABIP—are all above .8 (VIF values just over 1) suggesting excellent independence among these predictors. This is a regression worth running, and after running it, we can report to our readers that we have controlled for multicollinearity.

This brings us to our last topic: having settled on a method to diagnose and avoid multicollinearity—how do we communicate that success to the reader? As is the standard practice with statistical significance, I recommend a combination of narrative explanation and a parenthetical providing the precise details for those who truly care.

For the original example in R (Clutch / ISO issue), we could say the following:

The predictor variables were verified to assure reasonable independence (tol. >=.88, VIF <= 1.44).

To explain why we had to redo the Excel model (wOBA predictors), we could say this:

We evaluated the predictor variables, and found unacceptable overlap between all variables other than BABIP. (tol. < .03, VIF > 43.8).

As the title of this article suggests, the concept of “tolerance” needs to extend beyond the statistical measure itself. In educating readers to be more tolerant of statistical baseball analysis, writers need to give them the reassurance that the research we are producing is meaningful and accurate. The result may be that readers increasingly tolerate the notion that researchers actually know what they are doing, when researchers proceed to report their results accordingly. Shall we see how it goes?


Jonathan Judge has a degree in piano performance, but is now a product liability lawyer. He has written for Disciples of Uecker and Baseball Prospectus. Follow him on Twitter @bachlaw.
14 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Jim S
9 years ago

Nice article. Always happy to see some of the nuances of statistics explained well. Just a couple of minor quibbles from early in your article. Linear regression does not require that the outcome be a linear function of the predictors. The ‘linear’ part refers to the coefficients. That is, the outcome must be a linear function of the beta coefficients. You can include as many quadratic and cubic terms as predictors as you like. Second, the data does not need to be normally distributed. The error terms are often assumed to be normally distributed, but even that is not strictly necessary. Of course, if the data is not normal, there is a risk that the results will be subject to certain data points with high influence. But, I will leave that for you to explain in another article. Again, nice job.

Jonathan Judge
9 years ago
Reply to  Jim S

Jim, agreed with you on the first point. I was trying to explain things as basically as possible, but probably could have been more precise there. On the other aspects, I agree and probably should have mentioned that these assumptions need not be entirely correct either for the results be meaningful. Thanks for reading and for the nice feedback.

Doug Lampert
9 years ago
Reply to  Jonathan Judge

Calculations of statistical significance usually do assume normal, even with data that blatantly is not normal (most baseball statistics for instance are bounded, BA must be in the [0.000, 1.000] interval and thus clearly isn’t truly normal). Statisticians seem willing to accept this in most cases, and in many cases even have proofs that non-normal data the error will be in the “correct” direction (they’ll overstate the error and understate the significance).

On another note (for the readers, I assume you know): If you really want to include the difference between strongly correlated x1 and x2 as a factor then you can construct an x2′, which tries remove the influence of x1 from x2. I tend to think of ISO as exactly this sort of thing, you’re taking slugging and removing the BA component so the two won’t correlate as significantly and you can then study “power” and “batting average” as two independent inputs rather than having two statistics so closely related that you can’t be sure which influence you’re seeing.

There’s a field called Principle Component Analysis (PCA) which finds “the best” independent and significant factors. Unfortunately, the statistics it finds don’t lend themselves to easy human narratives/understanding.

Marc Schneider
9 years ago

You lost me. When I saw the title, I thought this might be raising the idea for co-existence between fans who dig sabermetrics and those who don’t. But, not, this is just an arcane analysis of statistical methodology completely inaccessible to anyone but stat people (I avoided saying nerds because it’s not nice.).

Jonathan Judge
9 years ago
Reply to  Marc Schneider

Marc — Indeed; I had some fun with the title. Glad you read, even if it wasn’t exactly what you were expecting.

Brad Johnson
9 years ago
Reply to  Marc Schneider

Actually, I was impressed with the accessibility. I remember scouring the internet back in my college days for a plain explanation of VIF (my notes just said to “use VIF” which meant nothing to me). Even now, a google search reveals mostly gibberish.

I think what you’re saying is you found the topic boring. The explanation was quite clear, but I can understand disinterest if you don’t work with baseball data on a regular basis.

Russell Carleton
9 years ago

This was wonderful.

Mr Punch
9 years ago

Non-baseball example of multicollinearity: Multivariate analysis of students’ performance appears to show that mothers’ educational attainment matters much more than fathers’ — but it turns out this is because fathers’ education correlates much more strongly with a third factor, household income.

studes
9 years ago

I loved the title too. BTW, I think OBP and SLG are good examples of multicollinearity. I don’t know how they meet your tolerance tests, but I’ve been in the middle of some recent debates in which people tried to suss out the two, and I’m not sure that’s really possible via linear regression.

Jonathan Judge
9 years ago
Reply to  studes

Studes, thanks very much.

The tolerance between OBP and SLG for qualified hitters in 2014 was .56. So, definitely some overlap, but they can still be considered together if desired. They are certainly more independent than OBP and wOBA for that group (tolerance: .23).

The scatterplot for the OBP vs. SLG for those same hitters looks very linear to me, but there certainly could be something going on between them within their respective components.

I discussed the essential predictors of OBP last week here: BB% + ISO + BABIP. http://disciplesofuecker.com/the-brewers-on-base-problem-and-solution-part-i/21680. That’s probably not much of a surprise to the initiated.

Jonathan Judge
9 years ago
Reply to  studes

Studes, the tolerance factor between OBP and SLG for qualified hitters is about ~.57 and that is consistent over the last few years. So, some overlap but more independent than not.

The relationship between OBP and SLG for qualifier hitters over the past four years looks pretty linear to me (r-squared =.43, p<.001) except for some outliers at either end. You can get a slightly better fit to SLG if you insert a spline for OBP at .349 and above, but the benefit is very small and probably not worth the trouble.

Nathan Lazarus
9 years ago

Anyone else see Andrew McCutchen and Zack Cosart and think that it was A-Z by first names? I feel stupid. This will definitely be a great post to link to when explaining minimizing multicollinearity. Thanks for writing this and titling it so cleverly!

grf
9 years ago

I find this discussion of multicollinearity in a sabermetric context very interesting. As has been mentioned, there is also some appreciation of what it means when distributions violate the normality assumption. An assumption that I haven’t heard mentioned is that distributions are stationary. Is anyone aware of this having been addressed in a sabermetric context?

George
9 years ago

While I like VIF, I’ve come to rely on addressing multicollinearity BEFORE beginning the model. The way do do this is through a variable reduction procedure such as Proc Varclus, which selects the variable set maximizing orthoganality (is that a word?).

There’s a lot of room for growth in statistical methods in sabermetrics.