Putting the scissor to defense (Part 1)

by Colin Wyers
May 28, 2009

Does the world need another defensive metric? Well, I needed one, at least. I will admit that it’s significantly inspired by what has preceded it, and as a system in its infancy it has some ground to make up to join the rest. I think there are a few little clever bits to the system that may interest you, though.

Moreover, I hope this is the last time anyone ever has to commence building such a thing from scratch, at least with the data at hand. Why? Because I’ll be providing full source code at the end. So if you’ve ever needed your own defensive metric as well, congratulations, you have one, too.

Two for the price of one

But wait, there’s more! Simple Zone Rating (pronounce it scissor, if you like) is not really one defensive evaluation system but two. SZR began with a simple premise: Try to devise the simplest play-by-play metric possible that would work with all years of Retrosheet data. (The availability of certain parts of the play-by-play data in Retrosheet varies greatly from season to season, particularly the further back in time you go.)

Then I got to wondering: How much of that system could I approximate using simply official fielding statistics? I then proceeded to cheat and start using official batting and pitching stats as well, but the result is still a fielding system that works without play-by-play data. I found this pretty exciting, to be honest.

Basic concepts

Either version of SZR works based upon the idea of judging a fielder’s efficacy by this formula:

Plays made/chances

Defining plays made is relatively trivial: We have a record of putouts and assists that provide a pretty good idea of when an infielder made a play on a batted ball (and understanding that can be enhanced by looking at the play-by-play data Retrosheet provides). Defining what a chance is proves to be much more bedeviling. So let’s break it down. A players’ fielding chances are:

His plays made.
A ball he would have made a play on, if not for an error.
A ball that was batted cleanly for a hit, but was fieldable by the player.

Assigning hits to fielders is the hard part. What we want to do, in an ideal world, is assign partial credit for a hit to those players most capable of fielding it. For instance, with a ground ball hit between the shortstop and third baseman, we’d assign part credit to each of them. And then ideally we’d compare a fielder to how other fielders at his position would have done.

For an infielder, what we’d ideally like to know about a batted ball when assigning credit is:

Where it was hit—at the very least, whether it went to left field, right field or center field. Ideally we have some sort of a vector or zone breakdown.
What type of ball is it? Ground ball? Fly ball? Line drive? Ideally we’d just get a distance and hang time measurement and be left to our own devices here, but such is life.
Where an infielder was standing when the ball was hit. Nobody I’m aware of tracks this systematically, but it’d be very nice to have.

Unfortunately, we have almost none of that for the majority of the Retroera. In a rather cruel twist (for our purposes, at least), the typical practice prior to about 1989 or so was to record batted ball type for outs but not hits. No, I don’t get the sense of that, either. We also don’t have location data, not even an idea of which outfielder fielded the ball. (Again, that was recorded for outs, but not for hits.) And we obviously don’t have that data for the years before Retrosheet. So what are we to do?

We estimate. In other words, we look at what we do know and see what it can tell us about what we don’t know. Typically this works fine, if nothing else because we use the typical relationship between what we do know and what we are trying to estimate. As we get to less typical cases, we of course lose accuracy in our estimates. The general hope is that over time, the inaccuracies in our estimates wash out. This won’t always be the case, of course, so it does some good to be cautious when applying the results.

My belief is that the perfect is the enemy of the good. Yes, these results are flawed. And so long as we bear that in mind, they’re better than no results. This of course doesn’t absolve me—or anyone—from not fixing mistakes or making improvements when possible. But I still think that it’s worthwhile to try, even when we know that perfection isn’t possible.

Play-by-play data

For the Retrosheet years, a play is made when the fielder who originally handled the ball is awarded a putout or assist on a ground ball. That is pretty straightforward and requires no estimation on our part.

Here is how hits are charged to players in SZR for years in which we have play-by-play data. First, the assumption is made that the distribution of responsibility for hits on balls in play (that is, excluding home runs) is equal to the relative number of plays recorded by the fielders at that position. This is of course not entirely correct, but any correction to this probably would introduce as much potential error into our figuring as it would remove.

So, for each season, the percentage of plays made at each position was calculated based upon the handedness of the batter and the pitcher—so the percentages with a lefty on the mound facing a righty are different from a lefty facing a lefty or a righty facing a righty. Then, each hit is credited to the fielders who were on the field at the time, based upon this percentage. Let’s look at the table from 1961, for instance.

BAT_HAND	PIT_HAND	FLD1	FLD2	FLD3	FLD4	FLD5	FLD6	FLD7	FLD8	FLD9
L	L	0.10	0.03	0.16	0.22	0.08	0.11	0.09	0.11	0.10
L	R	0.07	0.02	0.14	0.22	0.08	0.11	0.11	0.14	0.10
R	L	0.07	0.03	0.07	0.12	0.15	0.19	0.09	0.15	0.14
R	R	0.08	0.03	0.07	0.13	0.17	0.20	0.09	0.13	0.11

So if a left-handed hitter gets a hit off a right-handed pitcher, we assign 14 percent of the credit to the first baseman, 22 percent to the second baseman, 11 percent to the shortstop and 8 percent to the third baseman. Let’s call a player’s totals in this regard his partial hit credits.

We can further adjust these values for a number of constraints. We can look at the groundball tendencies of the hitter and pitcher as well. Using the log5 method, we can figure the odds of a ball in play being a ground ball or an air ball, and adjust the division of out probability among infielders and outfielders accordingly.

From here we can calculate a player’s zone rating as:

Plays made/(Plays made + errors fielding ground balls + partial hit credits)

Or more simply, plays made divided by chances. A player’s plus-minus rating is simply his zone rating minus the league average zone rating times chances.

Without play-by-play data

This is where it gets trickier. Let’s start with estimating plays made. I first looked at the average player’s plays made in relation to his putouts and assists, and used those figures to come up with these simple formulas to estimate plays made:

1B:: .85*A + .08*PO
2B:: .85*A
SS:: .85*A
3B:: .90*A+.06*PO

That’s it. No cute claim points or any other complex-looking set of figures. (I’ve yet to determine if this is a plus or a minus for the system.)

So that’s plays made. We’ll treat all errors as chances, again for simplicity’s sake. What about hits?

We first have to assign responsibility for hits at the team level, and then work our way down to the individual player level. We start off in much the same way as we did for assigning hits with the play-by-play data: We figure out the percentage of plays made by each position at the league level, and then for every hit allowed by a team, we assign partial credit to each of the team’s fielding units. (In this case, all of the players who spend time at first base are part of that fielding unit, and so on.)

Then we divide those partial hit credits among the various members of that unit. Now, we use games played at that position to figure out how much playing time a player has had at that position, but we intuitively know that a starting player tends to have more innings per game than a bench player or reserve. So a player’s total plate appearances are used to determine how many innings he should be credited for per game.

Now that we have both an estimate of plays and chances, we can figure zone rating and plus-minus as above. In both cases, we can convert plays to runs by multiplying by the average run value of a hit (absent homers) relative to an out. That value is typically about .7.

What’s past is prologue

This is a first, rough pass at this. Notably absent are, well, outfield rankings. Less notably absent—but still absent, and needed—are groundball/flyball and left-handed/right-handed pitching adjustments for the non-balls-in-play data, as well as various subtler adjustments to the play-by-play measures. All of this will come in the fullness of time.

But let’s get a snapshot of the system so far. Values below are a combination of the two systems, using the PBP version when it is available and otherwise using the simpler version. First, the infielders with the highest career totals in plays above average:

NAME	POS	YEARS	CH	PLUS_MINUS
Brooks Robinson	5	1955-1977	8933	382.8
Germany Smith	6	1884-1898	8644	289.2
Ozzie Smith	6	1978-1996	11090	286.6
Mark Belanger	6	1965-1982	7529	281.1
Jack Glasscock	6	1880-1895	7870	276.6
Bid McPhee	4	1882-1899	9585	251.3
Joe Tinker	6	1902-1916	7576	251.1
Travis Jackson	6	1922-1936	5980	237.2
Graig Nettles	5	1968-1988	7734	221.4
Billy Jurges	6	1931-1947	6364	218.8
Jimmy Collins	5	1895-1908	5416	218
Hughie Critz	4	1924-1935	6529	216.2
Art Fletcher	6	1909-1922	6592	203.6
Buddy Bell	5	1972-1989	7144	199
Dave Bancroft	6	1915-1930	8590	187.2
Marty Marion	6	1940-1952	6105	184.7
Terry Pendleton	5	1984-1998	5729	183.5
Bill Dahlen	6	1891-1911	10335	182.6
Lou Boudreau	6	1939-1952	6034	176.6
Joe Gordon	4	1938-1950	5865	172.6

And for players who made the most (fewest?) plays below average:

NAME	POS	YEARS	CH	PLUS_MINUS
Eddie Yost	5	1944-1962	5839	-171.7
Derek Jeter	6	1995-2008	7121	-157.8
Cub Stricker	4	1882-1893	5195	-148.2
Ed McKean	6	1887-1899	7470	-146.7
Larry Doyle	4	1907-1920	6307	-135.5
Heinie Sand	6	1923-1928	3543	-127.2
Dean Palmer	5	1989-2003	3144	-123.3
Jim Bottomley	3	1922-1937	3614	-122.9
Pinky Higgins	5	1930-1946	4985	-121.8
Bill Madlock	5	1973-1987	4132	-119.7
Red Kress	6	1927-1940	3428	-115.5
Jorge Orta	4	1972-1984	2357	-111.2
Tommy Dowd	4	1891-1898	1443	-108.5
Bob Aspromonte	5	1960-1971	3064	-105
Fred McGriff	3	1986-2004	4533	-103.8
Milt Stock	5	1914-1925	3795	-101.7
Mo Vaughn	3	1991-2003	2659	-100.4
Hal Chase	3	1905-1919	3821	-99.9
Fresco Thompson	4	1925-1931	2894	-98.8
Mickey Vernon	3	1939-1959	4413	-96

I don’t think that’s too crazy for a first pass. (And oh, look, every sabermetrician’s dream: to rank Derek Jeter’s defense and find it wanting! Truly I have arrived.)

Until next time… be seeing you.

References & Resources
Play-by-play information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at http://www.retrosheet.org.

Seasonal totals come from the Baseball Databank.

Major inspirations upon the design of the play-by-play system were TotalZone, Simple Fielding Runs and Ultimate Zone Rating.

The non-PBP metric has been inspired by a number of sources, such as Fielding Win Shares, DRA and Range.

Groundball rates for hitters and pitchers were regressed to the mean before use. I used the weighted average method from Tom Tango, figuring R from the method involving random and observed variance.

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG