Season similarity scores

by Zach Waters
November 21, 2008

In 2001, Ichiro Suzuki showed America a new style of baseball, a speedy, high average, low walk style of play never seen since…

Al Wingo?

Wingo had an interesting career. An alumnus of Oglethorpe University, he appeared in 15 games for Philadelphia as a 21-year-old in 1919. He acquitted himself well, with an OPS+ of 115, but he would make his next appearances as an outfielder in 78 games for Detroit five years later. As a 27-year-old in 130 games, the most playing time he would ever see, he batted .370/.456/.527, good for an OPS+ of 150 and a 12th-place MVP finish.

Together, those three seasons make Wingo into Ichiro’s most similar age-27 player on the incomparable Baseball Reference. The problem is, the two players aren’t particularly similar:

Player         From  To   Yrs   G  AB   R   H  2B  3B  HR RBI  BB  SO   BA  OBP  SLG  SB  CS  OPS+
Ichiro Suzuki  2001-2001    1 157 692 127 242  34   8   8  69  30  53 .350 .381 .457  56  14   126
Al Wingo       1919-1925    3 223 649 134 224  47  15   6  96  94  56 .345 .428 .492  16  18   136

In a season’s worth of at bats, Ichiro had 18 more hits, 13 fewer doubles, seven fewer triples, 64 fewer walks and 40 more stolen bases. The two players aren’t actually similar at all.

The alert reader probably saw this coming. Ichiro, after all, was an internationally famous superstar the first day he stepped onto a major league playing field. Thhat day occurred when he was 27, due to his playing in Japan for much of the previous decade. Paradoxically, any player truly comparable to Ichiro should have so much playing time by age 27 that his career numbers wouldn’t be comparable to Ichiro at all. It’s not really a coincidence that his most similar player at age 27 was an outfielder who had a career year in his first real opportunity for playing time at precisely the correct age.

And yet, the question remains. When was the last time America saw a player similar to the 2001 Ichiro? Has any player ever had a truly similar year?

Deconstructing Bill

Similarity scores were introduced by Bill James in The Politics of Glory, a book examining the Hall of Fame selection process. James sought to bring order to a common Hall of Fame argument: If Player A is similar to Player B, who is in the Hall of Fame, then Player A should also be elected. In a characteristically insightful approach, James realized that what was needed was a way to fairly compare a player to every other player, find the most similar players, and describe how similar they were. If you can say that Player A is similar to Players B, C, D and E, all of whom are in the Hall of Fame, you’re starting to make a very strong case for Player A’s election.

Aside from their original purpose, Similarity Scores give an element of vivid detail to baseball statistics. Whenever I want to learn about a player I’ve never heard of, the first thing I do is look at his list of most similar players. Finding someone I already know about makes the player I’m investigating come to life. The point of looking at Similarity Scores isn’t that the current system doesn’t work, it’s that the idea of Similarity Scores is such a good one that it’s worth improving as much as we can.

As employed on baseballreference.com, similarity scores are calculated by starting at 1,000 points and subtracting…
{exp:list_maker}One point for each difference of 20 games played.
One point for each difference of 75 at bats.
One point for each difference of 10 runs scored.
One point for each difference of 15 hits.
One point for each difference of 5 doubles.
One point for each difference of 4 triples.
One point for each difference of 2 home runs.
One point for each difference of 10 RBI.
One point for each difference of 25 walks.
One point for each difference of 150 strikeouts.
One point for each difference of 20 stolen bases.
One point for each difference of .001 in batting average.
One point for each difference of .002 in slugging percentage{/exp:list_maker}In addition, there’s a positional adjustment to account for players who spent their careers at different positions. In this essay, I will focus on batting similarity scores only.

So what happens if we use James’ system, but look at individual seasons instead of entire careers? That’s easy enough to program. Instead of starting at 1,000 and subtracting points, though, let’s calculate a “similarity distance” by starting at zero and adding points according to James’ system. If we do this, the 10 most similar seasons to Ichiro’s 2001 are:

First     Last       Year   BB   1B   2B   3B   HR   SB  Outs   BA   OBP   SLG   LWRuns   SSDist
Ichiro    Suzuki     2001   30  192   34    8    8  56   450  .350  .377  .457   46.380     0.0
Sam       Rice       1930   55  158   35   13    1  13   386  .349  .404  .457   38.820    14.3
Eddie     Collins    1924   89  154   27    7    6  42   362  .349  .439  .455   57.320    16.4
Jack      Glasscock  1889   31  155   40    3    7  57   377  .352  .385  .467   46.430    18.0
Bill      Terry      1934   60  169   30    6    8   0   389  .354  .412  .464   39.230    19.4
Sam       Rice       1925   37  182   31   13    1  26   422  .350  .385  .442   35.580    19.9
Rod       Carew      1973   62  156   30   11    6  41   377  .350  .413  .471   51.850    20.0
Buddy     Myer       1935   96  163   36   11    5   7   401  .349  .437  .468   53.200    21.4
Eddie     Collins    1913   85  145   23   13    3  55   350  .345  .435  .453   58.010    22.2
Carson    Bigbee     1922   56  166   29   15    5  24   399  .350  .405  .471   45.930    22.3
Charlie   Jamieson   1923   80  172   36   12    2  18   422  .345  .417  .447   46.880    22.4

Kind of an unexciting list, isn’t it? None of these seasons leap out and strike you as a great match for Ichiro. Looking through the list, we see that Ichiro stole 56 bases in 2001. In what’s supposedly the most comparable season in history, Sam Rice stole only 13, and drew 55 walks compared to Ichiro’s 30! Looking through these columns, we can see that there’s a lot of variation in every column except two: All of the top 10 seasons are near-perfect matches in batting average and slugging percentage.

The problem here is that Similarity Scores are designed to compare long careers to one another—the kind of careers that might make it into a discussion about the Hall of Fame. For a career like that, it might be reasonable to give the same number of points to a single point of batting average as you do to five doubles. But over one season, five doubles is a lot, and a single point of batting average is nothing. For finding similar seasons, James’ system is unbalanced toward batting average and slugging percentage. It almost always will find seasons which are perfect matches in these categories, with large variations in all other criteria. If we want to devise a similarity score that works well for a single season, we’ll have to do something new.

What’s the point?

If we want to devise a new system for similarity scores, we have to look at the idea of a point with a critical eye. In the last section, we saw that James’ point system becomes unbalanced if we dramatically alter the length of the periods we’re comparing. Presumably, the same kinds of problems would arise if we were trying to find comparable players to a player whose career was very short. Can we develop a system that works for any length of time?

Another aspect to consider is that the field of sabermatrics is a lot larger now than it was when James devised his original system. There are many more people using sabermetrics to answer many more questions. It would be nice to have a system that connects to the rest of what we know about sabermetrics. It’s a less obvious problem than having bad matches for single seasons, but I have to admit that, as long as I’ve enjoyed using them, I have no idea what a point means in a Similarity Score. Are Similarity Scores consistent with the rest of sabermetrics?

At the level of a single batting event, they don’t match up very well. Using Pete Palmer’s linear weights formula, the runs created by a particular batter can be estimated by

A Hardball Times Update

by RJ McDaniel

Goodbye for now.

Linear Weights runs= .47*1B + .78*2B + 1.09*3B + 1.4*HR + .33*(BB+HB) + .3*SB
-.52*CS – .26*(AB-H-GIDP)-.72*GIDP

We can now take the ratio of the run value of a single (.47 runs) and a double (.78 runs) to find that a single is roughly 60 percent as valuable as a double. But in James’ Similarity Scores formula, a single counts for 1/15 of a point, while a double counts for four times as much (1/5 point as an extra double plus 1/15 point as an extra hit).

Let’s make a table of the relative weights for the different batting events as compared to a single in Linear Weights vs Similarity scores:

Event           LW             SS
1B              1.00           1.00
2B              1.66           4.00
3B              2.31           4.75
HR              2.97           8.50
SB              0.64           0.75
CS              1.11           NA
GIDP            1.53           NA
out             0.55           0.2 (one plate appearance without a hit)

Pretty bad agreement!

Oddly, things become better if we compare the Similarity Scores ratio to the square of the linear weights ratio.

Event           LW^2             SS
1B              1.00           1.00
2B              2.76           4.00
3B              5.33           4.75
HR              8.82           8.50
SB              0.41           0.75
CS              1.23           NA
GIDP            2.34           NA
out             0.30           0.2

The agreement here is much better, but still not great. It appears that Similarity Scores match up with square of the linear weights run value of particular offensive events. I suspect a lot of the disagreement comes a desire on James’ part for a system that could be worked out easily by hand. In this age of ubiquitous computing, that’s no longer an important consideration.

Run distance

We would like to construct a new system of Similarity Scores that weights different offensive events in a way that is consistent with Linear Weights. Ideally, we would like this system to be easily adjustable to the different offensive contexts seen at different points in the history of baseball. Fortunately, nothing could be easier. To find the distance between two points, you simply take the square of the difference in each dimension, add them up, and take the square root.

The only catch is that we have to use the same units for distance in every dimension we use in the calculation. It doesn’t make any sense to add inches to seconds, even if inches is a perfectly reasonable distance in space and seconds is perfectly reasonable distance in time. Similarly, it doesn’t really make any sense to add singles in one dimension to doubles in another. We’d like to use some common system of units in which both a single and a double can be expressed in a meaningful way. This is exactly what Linear Weights does.

If we use Linear Weights to convert Ichiro’s 2001 season to the number of runs he contributed with singles, doubles, etc., we find that he produced…

30*.33=9.9 runs from walks
192*.47=90.2 runs from singles 
34*.78=26.5 runs from doubles
8*1.09=8.72 runs from triples
8*1.4=11.2 runs from home runs
56*.3=16.8 runs from stolen bases

On the negative side, he lost…

14*.52=7.28 runs from being caught stealing
53*.26=13.78 runs  from strikeouts
3*.72=2.16 runs from grounding into double plays
394*.26=102 runs from all other outs

It’s now easy to calculate a “run distance” using the distance formula given above. Because strikeouts, caught stealing, and grounding into double plays were not official statistics for all of baseball history, we’ll leave those categories out of the calculation. Although it would be easy to adjust for different levels of scoring in different years, at the moment we’ll just use the linear weights formula given above.

We can now search for the player seasons with the smallest distance from Ichiro, as measured in runs. The new top 10 seasons are:

First     Last       Year   BB   1B   2B   3B   HR   SB  Outs   BA   OBP   SLG   LWRuns  SSDist
Ichiro    Suzuki     2001   30  192   34    8    8  56   450  .350  .377  .457    46.4     0.0
Juan      Pierre     2004   45  184   22   12    3  45   457  .326  .368  .407    30.5    14.4
Ichiro    Suzuki     2006   49  186   20    9    9  45   471  .322  .367  .416    32.6    14.5
Ralph     Garr       1971   30  180   24    6    9  30   420  .343  .372  .441    32.2    14.9
Willie    Wilson     1980   28  184   28   15    3  79   475  .326  .352  .421    38.3    15.3
Richie    Ashburn    1951   50  181   31    5    4  29   422  .344  .391  .426    35.8    15.4
Steve     Sax        1989   52  171   26    3    5  43   446  .315  .366  .387    25.0    15.9
Sam       Rice       1920   39  170   29    9    3  63   413  .338  .377  .428    40.9    16.7
Matty     Alou       1969   42  183   41    6    1  22   467  .331  .369  .411    25.0    17.0
Sam       Rice       1925   37  182   31   13    1  26   422  .350  .385  .442    35.6    17.1
Frankie   Frisch     1923   46  169   32   10   12  29   418  .348  .392  .485    47.3    17.8

These seasons match up much better with Ichiro’s 2001 than the earlier list. Ichiro himself even appears in a later incarnation. It’s a little distressing that all of these players drew more walks than Ichiro’s 2001, but that’s due more to Ichiro’s own unusualness than anything else—there just aren’t many 242-hit, 30-walk seasons to choose from. A related issue is that all of these comparable seasons are distinctly worse than Ichiro’s 2001; this is more a list of “poor man’s Ichiro” seasons than true equals to Ichiro’s 2001.

Barry Bonds also had an unusual season in 2001:

First     Last       Year  BB    1B   2B   3B   HR   SB  Outs   BA   OBP   SLG   LWRuns  SSDist
Barry     Bonds      2001  177   49   32    2   73  13   320  .328  .510  .863   131.5     0.0
Mark      McGwire    1998  162   61   21    0   70   1   357  .299  .468  .753   104.0    16.1
Mark      McGwire    1999  133   58   21    1   65   0   376  .278  .425  .697    81.9    25.6
Babe      Ruth       1920  150   73   36    9   54  14   285  .376  .531  .849   127.4    32.6
Babe      Ruth       1927  137   95   29    8   60   7   348  .356  .486  .772   116.8    32.8
Babe      Ruth       1921  145   85   44   16   59  17   336  .378  .510  .846   139.9    33.5
Sammy     Sosa       2001  116   86   34    5   64   0   388  .328  .440  .737    99.4    34.8
Babe      Ruth       1928  137   82   29    8   54   4   363  .323  .461  .709    97.5    36.1
Mark      McGwire    1996  116   59   21    0   52   0   291  .312  .460  .731    79.5    38.0
Jim       Thome      2002  122   73   19    2   52   1   334  .304  .445  .677    77.8    38.1
Hank      Greenberg  1938  119   90   23    4   58   7   381  .315  .436  .684    88.1    38.6

No surprises here. There are no particularly good matches to a 73-home run season. Let’s look at some particularly famous or unusual seasons.

Babe Ruth’s 1927 is surprisingly untouched by the steroid era:

First     Last       Year   BB   1B   2B   3B   HR   SB  Outs   BA   OBP   SLG   LWRuns  SSDist
Babe      Ruth       1927  137   95   29    8   60   7   348  .356  .486  .772   116.8     0.0
Babe      Ruth       1928  137   82   29    8   54   4   363  .323  .461  .709    97.5    11.1
Hank      Greenberg  1938  119   90   23    4   58   7   381  .315  .436  .684    88.1    12.8
Jimmie    Foxx       1932  116  113   33    9   58   3   372  .364  .469  .749   112.3    13.4
Mickey    Mantle     1961  126   87   16    6   54  12   351  .317  .452  .687    89.4    14.4
Sammy     Sosa       2001  116   86   34    5   64   0   388  .328  .440  .737    99.4    15.4
Ralph     Kiner      1949  117   92   19    5   54   6   379  .310  .431  .658    81.0    15.9
Babe      Ruth       1921  145   85   44   16   59  17   336  .378  .510  .846   139.9    16.2
Babe      Ruth       1930  136  100   28    9   49  10   332  .359  .492  .732   108.8    16.2
Mickey    Mantle     1956  112  109   22    5   52  10   345  .353  .465  .705    96.9    16.7
Hack      Wilson     1930  105  111   35    6   56   3   377  .356  .454  .723   101.9    16.9

Ted Williams 1941: the last .400 season

First     Last       Year   BB   1B   2B   3B   HR   SB  Outs   BA   OBP   SLG   LWRuns  SSDist
Ted       Williams   1941  147  112   33    3   37   2   271  .406  .551  .735   112.1     0.0
Mickey    Mantle     1957  146  105   28    6   34  16   301  .365  .515  .665   100.1    11.5
Ted       Williams   1957  119   96   28    1   38   0   257  .388  .523  .731    93.7    13.3
Ted       Williams   1942  145  111   34    5   36   3   336  .356  .496  .648    95.9    17.1
Babe      Ruth       1926  144  102   30    5   47  11   311  .372  .513  .737   112.6    18.6
Babe      Ruth       1932  130   97   13    5   41   2   301  .341  .487  .661    83.8    20.5
Ted       Williams   1946  156   93   37    8   38   0   338  .342  .496  .667    98.1    20.8
Babe      Ruth       1924  142  108   39    7   46   9   329  .378  .510  .739   117.2    20.9
Ted       Williams   1954  136   80   23    1   29   0   253  .345  .515  .635    76.3    21.3
Babe      Ruth       1923  170  106   45   13   41  17   317  .393  .542  .764   135.3    21.6
Jason     Giambi     2000  137   97   29    1   43   2   340  .333  .475  .647    86.9    21.6

Rickey Henderson 1982, 130 stolen bases:

First     Last       Year   BB   1B   2B   3B   HR   SB  Outs   BA   OBP   SLG   LWRuns  SSDist
Rickey    Henderson  1982  116  105   24    4   10 130   393  .267  .397  .383    61.5     0.0
Rickey    Henderson  1983  103  109   25    7    9 108   363  .292  .411  .421    63.0    11.8
Arlie     Latham     1891   74  108   20   10    7  87   388  .272  .361  .387    36.7    20.8
Jim       Fogarty    1887   82   83   26   12    8 102   366  .261  .366  .410    46.1    21.0
Hugh      Nicol      1887   86   81   18    2    1 138   373  .215  .335  .267    28.5    21.1
Rickey    Henderson  1980  117  144   22    4    9 100   412  .303  .418  .399    63.3    21.1
Rickey    Henderson  1988   82  131   30    2    6  93   385  .305  .395  .399    50.4    21.5
Billy     Hamilton   1889   87  129   17   12    3 111   373  .302  .399  .395    56.2    21.9
Hub       Collins    1890   85  100   32    7    3  85   368  .278  .382  .386    41.7    21.9
Tommy     Harper     1969   95  105   10    2    9  73   411  .235  .350  .311    18.3    22.1
Rickey    Henderson  1998  118   97   16    1   14  66   414  .236  .373  .347    29.9    22.2

I’ll bet you didn’t have Arlie Latham in the office pool. Hugh Nicol is a surprisingly good match in everything but home runs.

George Brett’s .390 season in 1980:

First     Last       Year   BB   1B   2B   3B   HR   SB  Outs   BA   OBP   SLG   LWRuns  SSDist
George    Brett      1980   58  109   33    9   24  15   274  .390  .460  .664    72.8     0.0
Harry     Heilmann   1922   58  104   27   10   21   8   293  .356  .429  .598    55.6     8.7
Bill      Dickey     1936   46   97   26    8   22   0   270  .362  .424  .617    50.4    10.4
Joe       DiMaggio   1939   52  108   32    6   30   3   286  .381  .444  .671    68.0    10.4
Goose     Goslin     1928   48  110   36   10   17  16   283  .379  .439  .614    61.5    10.9
Rogers    Hornsby    1923   55  104   32   10   17   3   261  .384  .455  .627    59.7    11.4
Mickey    Cochrane   1931   56  106   31    6   17   2   299  .349  .419  .553    45.7    13.0
Hal       Trosky     1939   52   90   31    4   25   2   298  .335  .404  .589    46.1    13.1
Mike      Sweeney    2002   61  104   31    1   24   9   311  .340  .415  .563    49.7    13.4
Moises    Alou       1994   42   85   31    5   22   7   279  .339  .399  .592    43.8    13.9
Rico      Carty      1964   43   96   28    4   22   1   305  .330  .388  .554    37.3    14.0

I’m fascinated that Mike Sweeney, universally described as the best hitter the Royals have had since George Brett, turned in a season so similar to Brett’s magnum opus.

For a season heavy in doubles, let’s take Todd Helton’s 2000:

First     Last       Year   BB   1B   2B   3B   HR   SB  Outs   BA   OBP   SLG   LWRuns  SSDist
Todd      Helton     2000  103  113   59    2   42   5   364  .372  .467  .698   101.0     0.0
Carlos    Delgado    2000  123   97   57    1   41   0   373  .345  .461  .664    92.2    10.7
Albert    Pujols     2003   79  117   51    1   43   5   379  .359  .434  .667    85.1    11.1
Hank      Greenberg  1940   93   96   50    8   41   6   378  .340  .432  .670    84.5    13.5
Frank     Thomas     2000  112  104   44    0   43   1   391  .328  .437  .625    79.0    14.9
Derrek    Lee        2005   85  100   50    3   46  15   395  .335  .418  .662    83.5    15.1
Albert    Pujols     2004   84   97   51    2   46   5   396  .331  .414  .657    78.2    15.3
Frank     Robinson   1962   76  116   51    2   39  18   401  .342  .415  .624    77.3    15.7
Lance     Berkman    2001   92   97   55    5   34   7   386  .331  .423  .621    73.6    15.8
Todd      Helton     2001   98   92   54    2   49   7   390  .336  .431  .685    89.2    16.0
Todd      Helton     2003  111  122   49    5   33   0   374  .359  .461  .630    86.6    16.3

Surprising how many of those seasons came in a five-year window, isn’t it?

Alex Rodriguez’s best home run year brings back some memories for Seattle fans:

First     Last       Year   BB   1B   2B   3B   HR   SB  Outs   BA   OBP   SLG   LWRuns  SSDist
Alex      Rodriguez  2002   87  101   27    2   57   9   437  .300  .385  .623    68.3     0.0
Ken       Griffey    1997   76   92   34    3   56  15   423  .304  .382  .646    71.0     9.0
Ken       Griffey    1998   76   88   33    3   56  20   453  .284  .361  .611    62.1    10.2
Sammy     Sosa       1999   78   91   24    2   63   7   445  .288  .367  .635    64.0    10.6
Alex      Rodriguez  2001   75  114   34    1   52  18   431  .318  .390  .622    72.1    12.0
Johnny    Mize       1947   74   98   26    2   51   2   409  .302  .380  .614    58.6    12.2
Luis      Gonzalez   2001  100   98   36    7   57   1   411  .325  .420  .688    88.0    12.3
Ryan      Howard     2006  108   98   25    1   58   0   399  .313  .421  .659    79.8    12.7
Shawn     Green      2001   72  100   31    4   49  20   435  .297  .371  .598    60.8    13.3
George    Foster     1977   61  112   31    2   52   6   418  .320  .382  .631    65.1    13.6
Ken       Griffey    1999   91   96   26    3   48  24   433  .286  .379  .576    60.5    13.8

Conclusions

Finding similarities between different players is one of the most interesting aspects of sabermetrics, but it has been sorely neglected as an area of research. In this essay, I have tried to put the concept of player similarity on more solid ground by introducing the idea of “run distance” between two different statistical records.

One advantage of looking at player similarity in this new way is that problems which were previously very difficult to address now become simple. For instance, a common complaint about Similarity Scores is that a mediocre player in a high offense era can show a superficial similarity to a much better player in a low offense era. It’s not at all clear how this problem could be corrected using the traditional formula, but simply dividing the run value in each category by the number of runs per game scored in a particular park or league naturally produces a historically corrected Similarity Score. It would be similarly easy to construct a rate-based Similarity Score, where each category is divided by plate appearances, to account for seasons with differing amounts of playing time.

Improved Similarity Scores can help sharpen Hall of Fame debates by pointing out when a season is truly unique, or comparable to the greats of the past. Mostly, though, I hope that the improved Similarity Scores presented in this article will help the enjoyment of baseball statistics by pointing out the unexpected similarities and parallels in baseball history.

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG