Ten Things I Learned From A Bookby Dave Studeman
May 05, 2005
Michael Schell, a professor of Biostatistics at the University of North Carolina, has published a book ranking the best batters of all time. His previous book, Baseball's All-Time Best Hitters, was a detailed mathematical analysis of which hitters hit best for average, given their leagues, ballpark and general comparison to peers. This book, entitled Baseball's All-Time Best Sluggers, is the same sort of analysis, but Schell now ranks the best overall batters of all time.
I've spent the last week perusing and digesting Schell's tome, and I must say I have thoroughly enjoyed it. There is a lot to learn from this book.
I have three confessions to make, however:
- I didn't read his previous book, so I may be missing some of the nuance of his analysis.
- I'm not a mathematician, so I can't present a valid critique of his methodology.
- I don't really care how all the batters in baseball history rank.
But to derive his rankings, Schell undertook a very thorough mathematical analysis of many trends over the history of major league baseball, and he freely shares the results of his analyses in the book. And that makes it a goldmine to an avid baseball sponge like me.
Schell based his approach on one simple premise:
After adjusting for ballpark effects, an nth percentile player in one year is equal in ability to an nth percentile player in another year for each basic offensive event.In a very small nutshell, here is what Schell did:
- He classified all batters as regulars, non-regulars and pitchers, based upon the playing time patterns in vogue during each player's career.
- He established average levels of production for all regular players for a number of batting events, including singles, doubles, triples, home runs, walks, strikeouts, stolen bases, yada yada, as well as runs and RBI's and BA, OBP and SLG.
- He compared each player's stats against the average levels of production during that player's playing career
- He adjusted each player's stats by the ballpark the player played in, which means that he derived ballpark adjustments for each type of batting event throughout baseball history.
- He calculated the distribution pattern of each batting event around the mean, and adjusted the distributions to more closely resemble normal distributions. I don't believe any baseball analyst has done this before, but it is an important step.
- He adjusted for aging patterns, so that late-career declines don't diminish the player's standing.
- He normalized these rates to the most stable period in baseball history, the National League from 1977-1992.
- He weighted each event, using a linear weights approach, and then added up the adjusted totals to establish each batter's single season and career totals, which he calls Career Batting Rating (CBR). This is the basis of his batter rankings.
- As an added bonus, he compared each player to the average CBR of all those who played that player's position during the time in which he played, to present a position-adjusted CBR ranking.
You might say that, through his analysis, Schell has compiled the definitive list of well-adjusted batters. Or you might not like really funny puns. Whatever, here are some of the fascinating things I encountered while reading his book:
There have been six fundamental offensive eras in baseball since 1901.
Everyone has their own approach to defining baseball's eras. Schell based his upon something called piecewise linear regression, which identifies the most meaningful changes in the rate of various batting events. And that's all the explanation you're going to get from me.
The eras are:
- Deadball Era (1901-1919)
- Lively Ball Era (1920-1946)
- Post-World War II Era (1947-1962)
- Big Strike Zone Era (1963-1968)
- Designated Hitter Era (1969-1992)
- Power Era (1993-2003+)
The relative batting prowess of the average centerfielder has been generally declining since the beginning of Major League baseball.
Schell presents a table of the average seasonal batting record by position. Many of these positions follow a pattern I would have expected, except for centerfield. From 1876 to 1910, the average centerfielder was 5.8 batting runs above the league average. It rose to 8.5 in the 1910's, then started a relatively steady decline. There was a slight increase to 3.1 during the glory years of the 50's, but then the pace declined again. From 1993 to 2003, the average centerfielder was 2.6 runs BELOW the league average.
Whatever happened to the great centerfielders?
Bill Nicholson was as good a home run hitter as Reggie Jackson.
Talk about the power of adjustments... I admit that I had to look up Bill Nicholson in Baseball Reference. Nicholson hit 235 home runs in 5,546 at bats (21 per 500 at bats) from 1936 to 1953. For comparison, Reggie Jackson hit 563 home runs in 9,864 at bats (28 per 500 at bats).
However, Nicholson played in Wrigley Field (a terrible park for lefthanded home run hitters at the time) and his prime years were the war years of 1943 and 1944. He led the NL with 29 home runs in 1943, which doesn't sound like a lot until you look up the runner-up, who hit 18. In 1944, he again led the league with 33 home runs, and the next highest total was 26.
Once you apply Schell's adjustments, both Nicholson and Jackson averaged 32 home runs per 500 at bats. Because Jackson played longer than Nicholson, he is sixth on the all-time adjusted home run hitter list. Nicholson is 53rd.
How every major league park affected every batting event.
Most baseball analysts (myself included) use general park factors, based on runs scored, to evaluate hitters and pitchers in different ballparks. If we want to add more detail, we might throw in home run park factors.
In order to properly evaluate each batter's effectiveness, Schell took things further by analyzing the impact of each ballpark on each type of batting event listed above. Because this data doesn't exist for much of baseball history, he computed these park effects in a roundabout manner. Someone more mathematical than I might take issue with his approach, but I'll leave that to them.
Taking it one step further, he identified major changes in each ballpark's impact by using something called "multiple changepoint regression." This means that he not only calculated, say, the impact of Comiskey Park on triples throughout its history, but he also identified that from 1934-1935, Comiskey's triple park factor was 69, though it was 137 from 1974 to 1990. Stuff like that. And he listed this info in all its glorious detail in an appendix, so you can see it for yourself.
Some tidbits: the most "stable" park environment of recent times is Shea Stadium's, which has not significantly changed for ANY batting event since 1969. In other words, Shea is the most boring stadium in baseball today. The third-worst batting average environment ever is Dodger Stadium's from 1994 to present. And the all-time best ballpark for triples is the Bank One Ballpark in Phoenix.
Remember, when he says "best" or "worst", Schell is referring to figures that have been adjusted for each era.
The player whose ballpark most hurt his home run totals was George Brett.
Among the many great lists in the book are lists of the home run hitters most helped and hurt by their ballparks. Atop the list for those most hurt by their ballparks are George Brett, Roberto Clemente and Bill Nicholson. Those most helped by their ballpark were Mel Ott (by a wide margin), Ken Williams and Johnny Mize.
With those additional home runs, Brett would rank 58th on the all-time home run list, instead of his current 90th spot. Brett ranks as the 30th best batter of all time.
Home runs aren't the best predictors of runs scored or RBI's: Doubles and Triples are.
Schell examined the correlation between batting events and Runs/RBI's, and found that he could achieve an "R Squared" of .98 or better by regressing the events against Runs/RBI's. That's amazingly good. Along the way, he found out two other things:
- There are three eras in which the weights assigned to each event should change. They are 1893-1919, 1920-1946 and 1947-2003. You might say that these are the three most fundamental eras in baseball history.
- Across all eras, you can achieve an R Squared of about .90 by just regressing doubles/triples against runs or RBI's.
The all-time best doubles/triples hitters were Stan Musial and Honus Wagner.
The standard deviation of Runs Scored and RBI's has been higher in the National League ever since the 1950's.
Standard deviation, or the average difference between regular players in each batting category, is very important in Schell's analysis -- one of his chapters documents the performance spread of each batting event in each league. The Bill Nicholson case is a classic example; the typical performance spread, or standard deviation, among home run hitters was .065 to .070 during World War II, but it was between .05 and .06 every year thereafter.
The performance spread for many batting events has declined over the years, meaning that each category has gotten more competitive. However, the performance spread for stolen bases and triples has increased since the early years of baseball, as those skills now seem to fall only to a chosen few.
But when I looked at the graphs in the chapter (one reason I like the book is that it has LOTS of graphs), I was struck by a fundamental difference between the leagues. Since 1957, the AL's performance spread has been 12% lower than the NL's in runs scored and 10% lower in RBI's since 1954. In other words, the NL is more likely to be led by players by a wider margin in these two categories. The DH partly explains this difference, but there must be something else going on. Schell hypothesizes that the difference lies in league differences of style; this could be a fascinating study.
Too bad Ichiro didn't start his American major league career earlier.
According to Schell's analysis, the all-time best hitter for average is Tony Gwynn, with Ty Cobb coming in second and Rod Carew third. This is the same ranking he derived in his previous book -- the biggest difference is that Stan Musial is now fourth on the list, whereas he was 8th previously.
For adjusted batting average, Ichiro's 2004 ranks fourth on the all-time list and his 2001 ranks 21st. Even with his late start, Ichiro projects to land in the top 100 batters of all time if he follows a typical aging pattern.
Shawon Dunston was the most impatient batter in baseball history.
Schell leaves no stone unturned. In one of his chapters, he compares each player's walk rate to his expected walk rate, given the era in which he played. Shawon Dunston wins the crown of most impatient batter, with a ratio of 0.3 walks to expected walks. The most patient batter of all time is Max Bishop, with a ratio of 2.1.
The toughest player to strike out was Tony Gwynn, who had a 0.3 ratio of strikeouts to expected strikeouts. The easiest player to strike out was Gary Pettis.
For those of you wondering, Ozzie Guillen ranks among the top ten most impatient batters. Which means Chicago was indeed the Windy City for the ten years they both played there.
By the way, the best batting eye of all time, according to Schell's rankings, belonged to Yankee third baseman Joe Sewell. In 1932, Sewell walked 56 times but only struck out three times. Think about that.
Luis Aparicio was the second-greatest basestealer of all time.
While we're on the subject of stolen bases, Luis Aparicio deserves a mention. Bill James has written about this before, but Aparicio was a great basestealer for his time, underappreciated because his stolen base totals appear low compared to later totals.
But Aparacio led the American League in stolen bases nine straight years, and Schell's rankings place him second all-time behind Ricky Henderson.
At his current rate, Albert Pujols will be the eighth-greatest batter of all time.
Just behind Willie Mays and Lou Gehrig, and just ahead of Stan Musial and Ty Cobb.
I know. That was eleven things I learned from the book. Which tells you what a rich book it is.
Now, there are definitely issues to pick on in Schell's approach, and he admits to many of them himself. For instance:
- His approach to standard deviation probably isn't the right answer in the long term, because it doesn't capture the general increase in overall skill of batters, but no other methodology suggests itself.
- In his assessment of the number of regulars per team, why doesn't the DH have more of an impact?
- The aging adjustment is kind of funky.
- He devotes a chapter to clutch hitting which, frankly, feels pretty weak to me.
- It actually might make more sense to assign linear weights within the context of the playing environment, using something like Base Runs, instead of normalizing everything to 1977-1992 and then applying the linear weights adjustment.
However, given how much rich information there is in this book, I'd recommend it to any mathematically inclined baseball junkie, and I plan to put it on my bookshelf next to Curveball by Albert and Bennett, and the classic The Hidden Game of Baseball by Palmer and Thorn.
References and Resources
Nothing to do with baseball, but I thought you'd like this list of the Top Ten Things Learned by the Hubble Telescope.
Dave was called a "national treasure" by Rob Neyer. Seriously. Comments about this article can be sent to him through the miracle of e-mail.