Who was better: Walter Johnson or Roger Clemens?

Johnson won 417 games; Clemens 348. Johnson had a 146 ERA+; Clemens 144. Johnson played from 1907 to 1927; Clemens from 1984 to present.

Is that last fact relevant? I would say, extremely.

How can we compare Johnson, who pitched in an eight-team league that excluded blacks, Asians, and South and Central Americans, to Clemens, who has played in a league twice as large, and faced Hall of Famers like Eddie Murray, Kirby Puckett, and Dave Winfield, none whom would have even had a shot at the major leagues when Johnson was pitching?

In *The Hardball Times Annual 2007*, I tried to answer just that question by comparing the number of players in the major leagues over time to the population supplying baseball players to professional baseball. Basically, my thought was that the larger the pool of *potential* baseball players compared to the actual number that make it, the higher the quality of competition will be. Why is that?

Let’s say that we need 750 players to build a full major league. Let’s say that we decide to place one thousand randomly selected people on an island, and start a 30-team league on that island? How great will the quality of competition be? Obviously, not very high. Now let’s say another thousand people move to that island. The games will get better, because the best players among those thousand will now be signed to the major leagues, while the worst of the original major league players will drop out.

The effect can be shown on a more mathematical level, but I think it is obvious enough anyways. In the *Annual*, I showed that the ratio of players to the baseball-playing population has decreased steadily throughout the 20th century, and then with some mathematical trickery, built a timeline adjustment.

To be honest, I selected that method because it had not been done before, and I was curious what kind of results it would produce. But it is certainly not the only way to adjust for era; there are at least three distinct methods with which I am familiar.

The first comes from Bill James, in the *New Historical Baseball Abstract*. In the Bob Lemon comment, James lays out a list of “about a dozen” (actually 16) indicators to evaluate league quality, including hitting by pitchers, fielding percentage, and the average distance of .500. “From 1876 to the present,” he writes, “all of these indicia, without exception, have advanced steadily.”

In *Baseball’s All-Time Best Sluggers*, Michael Schell takes an idea from Stephen Jay Gould to adjust for era. In an essay titled, “Why No One Hits .400 Anymore,” Gould argued that the .400 hitter has disappeared because of rising quality of competition. His thought was that as league quality increases, the standard deviation, or variability, in statistics decreases.

Again, think about the thousand-person island. Of those 1,000 people, maybe a dozen are high school baseball players, a few dozen little leaguers, a couple hundred are of baseball age but no longer play baseball, and the rest don’t care for the sport at all. With that kind of distribution, how do you think the high school players will hit? .900? What about the worst players? The probably won’t hit at all.

Now what if another thousand people come to the island? Well now, instead of competing against 80-year olds and people with no interest in the sport, the high school players are playing pretty much only against people with some baseball experience. They’re still probably better than anyone else on the island, but maybe they hit only .400 or .500, and maybe the worst players hit .100 instead .000. The gap between players has gone down as the quality of players has gone up. The smaller that gap becomes, the harder it becomes to hit for an extraordinary average; if it is hard enough, it may become nearly impossible to hit .400.

Gould (and Schell) found that the standard deviation of every event in baseball has indeed decreased as time has gone on. But, as Schell points out, this method does not necessarily control for quality of competition, for it assumes that a 95th percentile player in 1907 is equal to a 95th percentile player in 2007, which might not be true. Today’s 95th percentile player could simply be better, and we’ll later look into whether or not that is the case.

But today, we will examine a third method, one introduced by Dick Cramer, and used extensively by Clay Davenport of Baseball Prospectus. That method is based on a simple premise: If we compare how players perform in one season compared to how they played the previous year, we can tell if the quality of the league is better or worse that season based on how their performance has changed. In other words, if the players are collectively worse the next year, it must mean they are playing against better competition; if they are better, the competition must be worse.

Using this method, Davenport has found that the quality of competition has indeed increased over time. In fact, it has increased so much that based on Davenport’s research, Nate Silver writes in *Baseball Between the Numbers* that today, Honus Wagner would be a “good-field, no-hit shortstop,” and Babe Ruth “merely a good player.”

Why is that? Let’s take a look, using the following methodology: We will remove pitchers, and then look at each hitter who had at least one plate appearance in consecutive seasons. We will weight his numbers in those seasons by the lesser number of plate appearances in those two seasons, and we will use weighted on-base average (wOBA) as our statistic of choice because it is all-encompassing and easy-to-use.

If we look at each season in baseball history from 1871 to 2005, and set the quality of competition in 2005 to “1,” here is what the evolution of the game has looked like:

Put in words, this graph shows that a baseball player in 1871 would have seen his wOBA reduced 70% in 2005! It’s no wonder, then, that Silver concludes that Wagner would have been a replacement-level hitter today. Excepting the drop during World War II, baseball players have been getting better year after year after year.

But is this the case? Can we really say that Wagner was no better than Neifi Perez? I say no, and surprisingly, the math is on my side.

The issue with such a measuring scheme is that it does not take into account an important mathematical phenomenon: regression to the mean. What regression to the mean tells us is that players’ performances tend to average; a .330 hitter one year will probably hit closer to .300 the next, while a .230 hitter will probably hit around .250.

But what happens in baseball? The .330 hitter indeed hits .300 the next season, but the .230 hitter either goes to the bench or to the minor leagues. In other words, it is an almost guaranteed fact that hitters, as a group, will perform worse in one season than they did the year before, regardless of league quality!

Luckily, we can correct for that issue by incorporating regression to the mean. If we regress each hitter’s wOBA in both years, the difficulty adjustment changes remarkably:

(A note for the math-inclined types: At first, I believed the correct method was to regress only the first year, but after some experimentation I have realized that is incorrect. The reason is that the player’s plate appearances in year n+1 are not just dependent on his play in the first year, but also the second. For the non-math types, it’s okay to start reading again.)

The difference between the two methods is huge! In 1908, Wagner posted a .446 wOBA (adjusted so that league average is .316). Using the first method, his wOBA would drop to .196, which is simply awful. The latter method, however, adjusts it to .339, nowhere near as great as what Wagner had looked before, but still well above average.

(Actually, what we want to use is an additive approach instead of multiplicative. That would adjust Wagner’s wOBA to .269 using the first method, and to .370 using the second. The first method still pins Wagner as a below-replacement player, while the latter method says he’s an All-Star.)

In fact, the latter method still shows league quality getting better over time; it just shows a much less pronounced rise. By 1925, the average player was 80% as good as today’s average player; by 1973, he was 90% as good. Conversely, the first method says that the quality of competition was not 80% as good as today’s until 1982, and he was not 90% as good until 1994.

Clearly, we can objectively adjust for quality of competition, and we now know how. What’s left is to actually do it and observe the impact. Next week, we will look at individual league adjustments and witness the impact that adjusting for quality of competition has on historical player ratings.