Introducing the Rate Similarity Scoreby Kerry Whisnant
February 23, 2012
Bill James introduced the concept of similarity scores by comparing career totals in games played, at-bats, runs, hits, and other offensive stats. Most of the stats used are counting stats, and therefore players can be similar only if they are similar hitters and have similar numbers of plate appearances.
This is fine as far as it goes, but it doesn’t identify similar batters with very different career lengths. So what if you made a similarity score based on purely rate stats? It should identify similar offensive players, no matter how many games they played.
Enter the Rate Similarity Score (RSS), which compares the rates of singles, doubles, triples, home runs, walks, hit by pitches, and strikeouts per plate appearance, and the number of stolen bases divided by the number of singles plus walks plus hit by pitches. This last ratio is meant to crudely estimate how often a player attempted to steal a base when given the opportunity, without using play-by-play data, which is not available for every player.
The RSS is determined by first finding the differences between the rates of two players in the above categories, divided by typical maximum rate differences for those stats. Then a root mean square difference (RMSD) is found over the eight rate stat ratios; the RSS is then the difference between the RMSD and 1. Two identical players would have a RMSD of 0 and therefore an RSS of 1, while two extremely dissimilar players would have an RSS near 0, although in practice the minimum RSS is not too far below 0.400. The RMSD can be thought of as the distance between players in an eight-dimensional space of rate stats.
Rates have been used to determine the similarity of players in The Hardball Times articles by Chris Jaffe on a season-by-season basis, and by Josh Kalk for pitchers using Pitch F/X data, and by Baseball Prospectus as part of their PECOTA rating system. I like RSS because it is relatively simple and uses career data, and provides a nice complement to the traditional similarity score.
The maximum rate differences used to normalize the rate differences between two players are: .160 for singles, .053 for doubles, .031 for triples, .076 for home runs, .180 for walks, .051 for HBP, and .330 for strikeouts (all using the rate per PA), and .460 for stolen bases divided by singles plus walks plus HBP. These values were taken from the maximum rate differences for all batters with at least 3000 PA, rounded to two significant figures.
Using this definition of RSS, the most similar players among those with 5,000 PA or more through the 2011 season (915 total players) are (# represents the similarity rank):
|1/2||MiltStock/Dave Cash||Dave Cash/MiltStock||0.97897|
Double entries occur when a player is the best match for the person who is their best match.
Note that RSS scores don’t account for length of career, different eras (although in principle it could if you used neutralized stats), defensive position, or defensive ability. For example, Phil Rizzuto actually had fewer PA than Bluege, but amassed more WAR (30.8 versus 24.7 according to baseball-reference.com) in that shorter time, played in a less offensive era, and was also MVP one year; Rizzuto, of course, ended up in the Hall of Fame. The Alomar/Larkin pairing is interesting, since they are both middle infielders and both were inducted into the Hall of Fame. Their main differences are that Alomar had 1,343 more PA, and, as a compensating factor, Larkin played a more demanding position (shortstop versus second base).
Bobby Bonilla and Fred Lynn had very similar career lengths and similar slash stats (AVG/OBP/SLG): .279/.358/.472 and .283/.360/.484, respectively. Lynn was Bonilla’s most similar using the standard similarity score—baseball-reference.com version—and Bonilla was Lynn’s second most similar. If there were no positional points for the standard similarity score, Bonilla very likely would have been Lynn’s most similar. You could just use slash stats to determine a rate similarity, but RSS has more detail since it considers doubles, triples and home runs separately (rather than simply total bases), differentiates between walks and hit by pitches, and also includes stolen bases and strikeouts.
The above table has a number of other interesting comparisons. Most similar by RSS, Winfield and Smith were actually most similar by the standard similarity test at ages 27, 29 and 30, but Winfield went on to a much longer career (12358 PA versus 8051). Otherwise Smith compares favorably with Winfield, including being a much better defender. Stan Hack was most similar to Luke Appling, except that Appling had about 20 percent more PA. Recent Hall of Fame inductee Santo had 1958 more PA than Wertz, and played in an era with less offense.
A very unique player according to his rate stats (among the 915 players with at least 5000 PA) is Billy Hamilton; he had the lowest RSS for a most similar player (Eddie Collins, .856683). Others who had a most similar with a low RSS were Otis Nixon (Dave Collins, 87250), Barry Bonds (Babe Ruth, .87476), Ted Williams (Babe Ruth, .87730), Hughie Jennings (Tommy Tucker, .87831), Rickey Henderson (Joe Morgan, .88203), and Mark McGwire (Harmon Killebrew, .88644). For all but Jennings, Stovey and Greenberg, the player with the highest RSS for these players also shows up in their Top 10 most similar list, according to the standard similarity score.
Other interesting pairs who were most similar to each other using RSS:
In some of the cases shown here a most similar player is in the Hall of Fame, or often discussed as a possible candidate, while the other is not. Usually it is a difference in career length (number of PA) or fielding position that accounts for our different assessments of two players who were actually very similar offensively, as measured by their RSS. Although RSS does not include a positional factor (unlike the traditional similarity score), in many cases two most similar players did play the same or similar positions, as can be seen in the above tables. This should not be too surprising since a given field position is often manned by similar players. Also interesting is that in a few cases the most similar player was a teammate: Boyer/White, McGriff/Justice and Bagwell/Berkman.
The two most dissimilar players (lowest RSS) with 5000 or more PA are McGwire and Jennings, with an RSS of .37948. The next two smallest similarities also involve McGwire with Willie Keeler (.38546) and Joe Jackson (.38624)—not a lot of home runs or strikeouts and a lot of triples and stolen bases are the major distinguishing characteristics when comparing to McGwire.
The players who had the most players to which they were most dissimilar (least similar) are:
Player # most dissimilar to MarkMcGwire 526 HughieJennings 293 VinceColeman 63 WillieKeeler 23 BuckEwing 8 BillyHamilton 1 JoeJackson 1Clearly McGwire had a very unusual batting profile. Adam Dunn was the second most dissimilar to 395 players, and he was the third most similar to McGwire.
The player who had the largest least-similar number was Brian Jordan, whose smallest RSS (.63576) was with Vince Coleman. Jordan might be considered the most average player, not too far from anyone else. A close second was Cy Williams, with a smallest RSS of .63390, also compared to Vince Coleman.
There is a lot more that hasn’t been shown here. If anyone wants to look at an Excel spreadsheet with the five most similar and five most dissimilar players for every hitter with at least 5000 PA through the 2011 season, the file may be found here. The spreadsheet also has the top five similars and dissimilars for all players with 3000 or more plate appearances.
You could include other stats such as runs and RBI in RSS, but those are more situational and are more affected by a player’s teammates. The number of PA/G could also be used, which would measure the extent to which a player was used as a starter or substitute, but in developing RSS I wanted something that compared players’ offensive performance on the field. I have calculated RSS’s including runs, RBI and games, and in many cases the same names show up as most similar, but sometimes not.
Of course you can define an RSS for pitchers as well, which will be the topic of a future article. The spreadsheet linked above also contains pitching similars and dissimilars. Another possibility I have been looking at is to neutralize stats for era before calculating RSS.
Kerry Whisnant is a professor of physics at Iowa State University during the day, avid St. Louis Cardinals fan at night, and occasionally writes articles for dugoutcentral.com.