Measuring greatness (part 2)by Mike Carminati
April 13, 2009
In part 1, we looked at two stats devised by Bill James, Win Shares and the Hall of Fame Monitor, and found that they generally hold up as standards for the Hall of Fame. As for Black Ink Test and Similarity Scores, I think they are valuable, but there are ways to augment them. The Black Ink test has an inherent era bias built into it, which James alluded to when he published his study: “Of course, it is harder to lead the league in multiple categories now, when there are fourteen [or 16] teams in a league, than it was in 1935, but hey, nothing’s perfect.” (p. 67)
Sure, nothing is perfect, but now that there are twice as many teams in the National League than there were prior to expansion, we can try to make the playing field a bit more even.
Note how the average even among the best candidates (based on Win Shares Grade) have been losing in the Black Ink Test since expansion:
|WS Grade||Decade||Black Ink Avg|
I propose that we weight the Blank Ink Test to compensate for league size. However, we just base it on number of teams then we end up compensating a league leader today twice as much as someone prior to expansion which overly devalue their feats. We will weight the additional teams above eight by a factor of 0.5. (Also, given the disparity in player and team quality in the early days of the game, I have kept weighting factor for 19th century league leaders to one. Otherwise, the handful of players who led the early leagues got too healthy a bump. Why should they benefit because the Altoona Mountain City club happens to field a handful of games in their league for a month or so?)
Also, reviewing the system James used to award points for individual stats, I had a few questions. First, why were OBP and OPS left off the list? Over the last 15 years, they have become probably the two most important common stats for measuring batter performance. Perhaps James felt that the voters were not so advanced as to understand the concept. I will give the benefit of doubt and add it to the mix. Even the NFL Channel, er, I mean ESPN, uses them now.
Second, I think that the point system can be revamped to more closely align the points assigned to the value of the stat, at least in the voters' eyes. I ran a correlation between getting in the Hall and the total times leading in a category. The numbers were pretty close to James as it turned out, but there were some minor changes. (Points for batting average leader went from 4 to 3.5, for RBI went from 4 to 3.25, for HR went from 4 to 3, for R went from 3 to 3.5, for H remained at 3, for SLUG went from 3 to 3.25, for 2B went from 2 to 2.75, for BB remained at 2, for SB went from 2 to 1.5, for G went 1 to 2.25, for AB from 1 to 1.5, for 3B from 1 to 2.25, for OBP went from 0 to 3.25, and OPS went from 0 to 3.5.)
Here are the top 25 players for James’ Black Ink Test and for the Modified Black Ink Test:
|Black Ink Batting Leaders - Original||Black Ink Batting Leaders - Modified|
|Babe Ruth||158||Babe Ruth||223.10|
|Ty Cobb||146||Ty Cobb||201.25|
|Rogers Hornsby||128||Ted Williams||198.00|
|Ted Williams||126||Rogers Hornsby||195.25|
|Stan Musial||116||Barry Bonds||178.59|
|Honus Wagner||109||Stan Musial||178.35|
|Dan Brouthers||79||Honus Wagner||151.65|
|Hank Aaron||76||Dan Brouthers||130.70|
|Lou Gehrig||75||Mike Schmidt||115.31|
|Mike Schmidt||74||Lou Gehrig||108.65|
|Nap Lajoie||72||Alex Rodriguez||105.33|
|Barry Bonds||69||Pete Rose||101.97|
|Alex Rodriguez||68||Carl Yastrzemski||101.36|
|Pete Rose||64||Nap Lajoie||92.15|
|Jimmie Foxx||63||Hank Aaron||91.51|
|Mickey Mantle||62||Mickey Mantle||88.19|
|Harry Stovey||62||Wade Boggs||87.45|
|Chuck Klein||60||Jimmie Foxx||85.75|
|Ed Delahanty||59||Willie Mays||85.29|
|Ross Barnes||59||Ross Barnes||84.90|
|Willie Mays||57||Ed Delahanty||83.75|
|Carl Yastrzemski||55||George Brett||78.18|
|Tony Gwynn||53||Rod Carew||75.79|
|Ralph Kiner||52||Chuck Klein||73.65|
|Cap Anson||52||Rickey Henderson||73.29|
Next, I looked at the point system for the pitchers. Using the same method, I found new values for the various league-leading stats. In addition, I added Strikeouts-to-Walks ratio and WHIP (Walks plus Hits per Innings Pitched to James’ league leaders to perform the evaluation. The results were a bit farther off from James than the batting stats. Also, the results confirmed that WHIP more closely correlated to being a Hall of Famer than the component stats, Hits per Innings Pitched and Walks per Innings Pitched. Points for wins leader went from 4 to 4.75, for ERA remained at 4, for Strikeouts went from 4 to 4.25, for Innings Pitched from 3 to 4.5, for Winning Percentage from 3 to 2.75, for Saves from 3 to 1.5, for Complete Games from 2 to 4.25, for Games Pitched from 1 to 2, for games started from 1 to 3.25, for Shutouts from 1 to 4.25, and WHIP—4.25—replaced BB per IP and H per IP, which would have remained at 2 points and gone from 2 to 3.25 points, respectively. Also, I hate when someone throws in some idiosyncratic stat, but given that relievers are so undervalued, I threw in a stat I developed based on Bill James’ reliever theories and using runs created adjusted by era. It is called Relief Win and had a value of 1.75.)
Here are the top 25 in Black Ink pitching based on the original and my modified formula.
|Black Ink Pitching Leaders - Original||Black Ink Pitching Leaders - Modified|
|Walter Johnson||150||Walter Johnson||264.25|
|Pete Alexander||126||Pete Alexander||201.75|
|Lefty Grove||111||Lefty Grove||187.50|
|Roger Clemens||103||Cy Young||182.00|
|Warren Spahn||101||Warren Spahn||166.50|
|Cy Young||100||Roger Clemens||157.50|
|Randy Johnson||99||Christy Mathewson||155.50|
|Bob Feller||98||Bob Feller||151.75|
|Christy Mathewson||92||Greg Maddux||143.25|
|Greg Maddux||87||Randy Johnson||130.75|
|Nolan Ryan||84||Sandy Koufax||121.25|
|Sandy Koufax||81||Dazzy Vance||119.50|
|Al Spalding||67||Ed Walsh||119.00|
|Ed Walsh||67||Robin Roberts||118.00|
|Dazzy Vance||66||Steve Carlton||104.25|
|Steve Carlton||66||John Clarkson||98.25|
|Joe McGinnity||64||Joe McGinnity||96.50|
|Robin Roberts||64||Al Spalding||95.00|
|Tom Seaver||60||Carl Hubbell||94.75|
|John Clarkson||60||Pedro Martinez||94.75|
|Pedro Martinez||58||Tom Seaver||89.75|
|Tim Keefe||58||Nolan Ryan||88.50|
|Amos Rusie||52||Curt Schilling||86.50|
|Dizzy Dean||52||Tim Keefe||81.50|
|Carl Hubbell||51||Dizzy Dean||80.25|
Using the average Hall of Famer as a guide, the following players who are not in the Hall would meet the original Black Ink Test:
For the most part, there are 19th Century players and active players who are obvious Hall choices.
Now, these are the players that meet the modified Black Ink Test standard:
I cannot say that I am overly pleased by Juan Pierre making the list, but there are a number of undervalued expansion-era players that also make it. A number of players on the continuing purgatory of the writers' ballot (Parker, Mattingly, Murphy) make an appearance as well as some on the Vets list (Magee, Allen).
By the way, Rice now passes this test with a 49.57.
Next, we move to Similarity Scores. Of this method, James says, “A left-handed hitter will tend to be paired with another left-handed hitter. A player from the 1920s will tend to be paired with another player from the 1920s. A player who has post-playing career as a manager will tend to be paired with another player who was also a manager…Size—players are almost always paired with another player of the same size.” (p. 95)
That sounds like a great justification for a system, but it also might point to era biases within the system that pick players with similar demographics perhaps because those demographics directly impact the stats.
I propose that instead of comparing by raw stats, we compare based on stats weighted against era bias. Instead of looking at home runs hit, for example, we look at the number of home runs hit above expectation for a given era. Might we find similarities that are deeper and yet not apparent from the James’ Similarity Scores?
I created a set of expected values per stat per league. Then I took the total plate appearances for each player per league and year and created the amount he exceeded or was short of what was expected of the average player. In this way, Mike Schmidt leading the league with 36 home runs, which he did twice (1974 & 1984) stands out a bit more than the three players who hit 36 home runs in the American League in 1996—Alex Rodriguez, Geronimo Berroa, and Ed Sprague (really)—and ended up tied for 13th place in the league behind Mark McGwire’s 52 bombs.
Finally, the amount above/below expected was prorated per plate appearance over the player’s career. In this way, Babe Ruth exceeded the expected number of home runs for his career by .057 per plate appearance. He is followed by Mark McGwire (.051 per PA), Ryan Howard (.045), Jimmie Foxx (.041), Dave Kingman (.040), Ralph Kiner and Hank Greenberg (.039), Lou Gehrig (.038), and finally rounding out the top ten: Barry Bonds, Harmon Killebrew, and Mike Schmidt (.036). Ed Sprague is 785th.
I dropped Jasmes’ penalties for games played and at-bats (assuming a minimum of 1000 plate appearances). I compared runs, hits, doubles, triples, home runs, runs batted in, walks, strikeouts, stolen bases, batting average, on-base percentage, slugging, OPS, and kept James’ defensive position penalty (though I have outfielders truly divided by actual position not just lumped into one bucket as is done at Baseball Reference and may have been done in James’ original study).
The results are that the top-10 comps to Hank Aaron change from:
1. Willie Mays (783) *
2. Barry Bonds (I) (748)
3. Frank Robinson (667) *
4. Stan Musial (663) *
5. Babe Ruth (647) *
6. Carl Yastrzemski (627) *
7. Rafael Palmeiro (I) (611)
8. Mel Ott (601) *
9. Eddie Murray (588) *
10. Ken Griffey (I) 588)
Joe DiMaggio* (855)
Willie Mays* (810)
Frank Robinson* (765)
Johnny Mize* (761)
Larry Walker (I) (754)
Vladimir Guerrero (I) (712)
Sam Thompson (663)
Chuck Klein* (657)
Harry Heilmann* (635)
Lip Pike (635)
I cannot say that I am entirely happy with either one of those lists. I think each has its plusses and minuses, but I am willing to follow the methodology and see what results we get being in the democratic mindset statistically speaking as we are.
By the way, the players with the lowest scores for their best comps are Babe Ruth and Ted Williams (both at 419) and ever execrable Bill Bergen (566), who was arguably the worst Hall-eligible batter of all time (sorry, Steve Jeltz does not qualify), he of the .170 career average, .228 OBP, .232 slugging average, and park-adjusted OPS 53 points below the league average. By James’ system Bergen has a number of comparable players because his stats though abysmal fall within the range of a number of lesser batters during mostly pitcher’s eras.
So where does this all leave us? If we look at the three players that started this blogging bloviation, the three men who are about to enter the Hall—Rickey Henderson, Jim Rice, and Joe Gordon—does this new methodology help to make their cases (or lack thereof) more clearly? To quote the estimable Mr. Owl, let’s find out.
As I stated earlier Henderson passed all four standards, Rice passed all but Hall of Standard (by 7.1 points), and Gordon passed not a one.
Of our new tests, Henderson passes three of four: Win Shares above baseline—both tests (210.35 WSAB), and Black Ink Modified (73.3). However, he fails the Modified Similar Scores test. Just one similar batter, Max Carey, is in the Hall, and most of the rest are Deadball-era leadoff outfielders with decent speed and good OPS’s, but Henderson’s uniqueness might not be properly captured there. Henderson falls from 100 percent passing to 88 percent.
Rice passes just the Modified Black Ink test (49.6). He has 96.34 WSAB, so he falls both Win Share tests. He has just one similar batter (Tony Perez) in the Hall. Rice falls from 75 percent to 50 percent, which seems to capture his borderline status properly.
Gordon keeps his perfect goose-egg streak going, going 0-for-4 in the new tests. He has just 80.67 WSAB, failing those two tests. He scores a 4.5 in the Modified Black Ink test. He has just one similar batter in the Hall (Bobby Doerr). Gordon remains at zero percent.
I think that these new tests are in the spirit of James’ “democratic” view of Hall of Fame candidates. The Modified Similarity Scores might need a little more tweaking and that is not surprising given the complexity and the scale of the calculations involved. At the other end of the spectrum, Win Shares Above Baseline are a vast improvement over the Fibonacci Win Scores that James originally proposed. In addition, having eight rather than four tests (five if you count the Fibonacci test) helps more clearly define the borderline cases, as Rice demonstrated.
My site with my opinions, but I hope that, like Irish Spring, you like it, too.