The one about sample size

by Colin Wyers
June 4, 2009

Okay, first a status update. Due partially to other demands on my time and the sheer amount of work involved in some of what I’m currently doing with it, I don’t have anything worth showing off yet as an update to last week’s article on Simple Zone Rating. For those of you anticipating such, apologies.

Here’s a little diversion that should prove interesting. Sabermetricians often refer to sample size, and they’ve (we’ve?) gotten so insistent that you’ll hear fans on message boards and television announcers mentioning it. But what does it all mean? Let’s roll up our sleeves and get a little dirty.

True Score Theory

The basic principle of statistics underlying our concern with sample size is True Score Theory. What is says that whenever we measure something, our measurement comes from two things: the value of what we are measuring and the amount of error in our measurement. The more often we measure something, the lower our measurement error and the more confident we can be that our measurement actually reflects what we’re measuring.

Be careful, though—there are two kinds of measurement error: random and systemic error. Random error eventually will dissipate over enough observations. Systemic error will not; no matter how many observations we have, a systemic error will persist.

Random errors in measurement are often called noise. In baseball, they have acquired another name as well: “luck.” Luck is a loaded word that a lot of people have problems with. Since I do not get paid a commission for the use of the word luck, I see no reason to require its use and will avoid it here.

A little bit of math

There are some common tools used to evaluate things like sample size in statistical terms. We’ll start off with a few.

I’m sure you know what an average is; I just want to note that what is commonly called average is more precisely referred to as the “arithmetic mean” in formal terms.

There are several ways we can describe the distance between a value and the mean. For right now, we’ll stick with the standard deviation, also called a “sigma.” Here’s how it works:

Take each value and subtract the mean.
Square the result.
Take the average of the squares. (For those curious, at this point we’ve calculated the variance.)
Find the square root of the variance.

This puts a range on how far values are from the mean; in a normal distribution 68 percent of values are within one SD of the mean, and 95 percent of the values are within two SDs of the mean.

One more concept is correlation. Correlation is essentially the slope of the line that best fits the relationship of two sets of variables. A correlation of 1 means that the two are perfectly positively correlated; a correlation of -1 means that the two are perfectly negatively correlated. A positive correlation of 1 means that the two values rise in perfect proportion: When one changes, the other changes by the same amount. A negative correlation of 1 means that the two values change inversely in perfect proportion: When one changes, the other changes by the same amount but in the other direction.

To break that down, a correlation of .2 (or -0.2) means the two are weakly related; a correlation of .7 means the two are strongly related. There is no objective breakdown of what a “strong” or “weak” correlation is; there are guidelines one can use but they’re as much a matter of taste as anything else.

If we want to measure how well the true score is reflected by a measure, we can use correlation to test it. What we look at is how well a value correlates with itself. We can look at year-to-year correlation (how well a measure correlates between years for the same player), or intraclass correlation (how well a measure correlates between a player’s performance in split halves, like taking the even and odd numbered events and grouping them).

April is the cruelest month

This is why April is the cruelest month for a baseball statistician; we know a lot of things are going on that are interesting and exciting and meaningful, but we simply don’t have the tools to suss out what’s true and what’s simply noise. All we are really left to do is throw up our hands and say, “Call us in June and we’ll see what we can do.”

There are a few tools you can use, though, if you’re not particularly concerned about being correct. The biggest one is confirmation bias. In other words, a small sample of something is valid if it says what you were already thinking to begin with. This is true to the extent that you were correct to begin with; the additional “evidence” presented by a guy getting off to a hot or cold start to April doesn’t add much to your argument. (Now, of course, a good player is more likely to have a hot start and a bad player is more likely to have a cold one, but not to the extent that a hot or cold start can tell us who is a good or bad player.)

There is, of course, another issue, that of the magnitude: The hotter or colder the start, the more likely it is to be true and not noise. But—but!—there’s something we have to remember about our measurement of magnitude. Recall that standard deviation is the square root of variance. And our basic formula for a measurement:

Measurement = True + Random + Bias

A Hardball Times Update

by RJ McDaniel

Goodbye for now.

And the more observations we have, the smaller the value of random should be. And as randomness increases or decreases, so does our measurement of distance between a value and the mean. To see what I mean, look at these standard deviations of home runs per plate appearances, 1993-2008, grouped by number of plate appearances:

MIN_PA	MAX_PA	SD
0	9	0.040
10	19	0.028
20	29	0.021
30	39	0.020
40	49	0.022
50	59	0.020
60	69	0.017
70	79	0.017
80	89	0.016
90	99	0.014
100	109	0.016

Note the right-hand column: The standard deviation goes down with plate appearances. (There is still some “noise” there which could be smoothed out; consider this an illustration, rather than an actual solution.) So for someone with 100 PAs, a home run rate of .08 above average (in other words, about the rate Barry Bonds hit home runs in 2001) is five standard deviations away from the mean. We should expect to see that in only one out of every 1,744,278 cases, assuming home run rates are normally distributed. But for a player with only nine plate appearances, a home run rate of .08 above average is only two standard deviations away from the mean, which we should expect to see in about one in every 22 cases.

So for an observation to be extreme at a small sample size, it has to be more distant from the mean than it would in a larger sample size. This is especially important to bear in mind when dealing with splits data—batting in certain lineup spots, for instance, or batter versus pitcher matchups.

And the sky full of stars

Okay, but what if we find something dramatic – something three or four standard deviations away from the mean? That doesn’t tell us anything unless we know how many cases are under observation. From 1993-2008, there have been:

3,487 hitters
2,267 pitchers
611,547 unique batter-pitcher match-ups
14,676 player seasons for hitters (10,079 excluding pitchers hitting)
61,673 player months for hitters (48,138 excluding pitchers hitting)

Especially once you start splitting the data extremely fine, you should expect to see a lot of things beyond three standard deviations. The more specific the split, the more extreme cases you should expect to see.

Regression to the mean

As our number of observations increases, the noise goes down, and observations tend to become closer to the center of the distribution. That’s called “regression to the mean.” How much regression should we expect?

That depends on how much noise we pick up with our observations. We can measure that with our correlations, either year-to-year, intraclass, or some other way of testing self against self. The higher the correlation, the less we need to estimate the regression to the mean.

But the best answer is to simply use more data. Why should we regress Albert Pujols’ April stats to the mean? We have more than 5,600 PAs that tell us that Pujols is a very good hitter; we should deny ourselves of the advantage of all that extra data only when we have a very, very good reason to suspect it doesn’t matter.

References & Resources
The procedure for calculating standard deviation in the article is for population standard deviation; all standard deviations listed are actually sample standard deviations. Simply subtract one from the denominator when figuring the average to get sample standard deviation.

Sal tells us how to regress to the mean. So does Eli. Pizza Cutter gives us correlations for a lot of common batter and pitcher stats.

For the purposes of the article, I’ve assumed a normal distribution. In the past I’ve written several articles expressing some reservations about that notion, although that shouldn’t affect the broader points.

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG