Ever since the publication of Joseph Adler’s Baseball Hacks in 2006 made me aware of the existence of MLB Gameday’s hit location data, I have been excited about the possibility of using this data in baseball analysis. Retrosheet is a magnificent resource that remains the rock on which all amateur, but serious, baseball statistical research is based. But one of the few things that Retrosheet lacks is reliable hit location data.
Hit location information has been securely in the hands of the for-profit baseball data collectors, such as Baseball Info Solutions and STATS Inc. Getting access to it seemed beyond my means. On the other hand, MLB Gameday seems like a realistic, viable, no-cost alternative. But the process of getting good fielding data from Gameday has been much more difficult than I envisioned.
MLB Gameday uses hit location data primarily to create the hit charts shown in the statistical information for batters. It is collected by contract employees of MLB who sit in the press box with a laptop computer and enter the hit location with a cursor point directly onto an image of the park. The field image is just 250 by 250 pixels. The x and y pixels of the hit location are then stored in the Gameday XML files. Fractions of a pixel appear in the hit location files, apparently due to a translation from the coordinate input file to a slightly different output coordinate system.
The information is used by Gameday for images and entertainment purposes and was never intended for analysis. But the person inputting the data is instructed to make the information as accurate as possible, given the limitations of the system. The hit location information also includes the identity of the hitter and pitcher, a brief description of the hit ball outcome (Single, Fly out or Error, for example), the inning, and notations for hit or out, and home or away batting team.
To use the hit locations for fielding analysis, the raw data has to be downloaded and appended to an existing play-by-play database and the location in pixels converted to on-field X and Y coordinates in feet, and ultimately angle and distance format. Although appending the data to Retrosheet poses its own set of problems, they are not insurmountable. Today’s discussion is about converting the pixel information.
There are two steps to the process: finding the exact location of home plate, and establishing a multiplier to convert pixels to feet.
When I first began this process for an article on observational data that I wrote last year for the Hardball Times, I had assumed that home plate location and distance multiplier would be the same for each park. I was wrong.
For Gameday’s purposes, the data only has to be keyed to their own park image. Since they scale and locate the image to maximize the field area within the 250-by-250 pixel box, the home plate location and distance multipliers are different for each field; markedly different on a few fields, but not exactly the same on any of them.
One way to resolve these differences would be to have pixel maps of all the fields, and to actually pinpoint the exact pixel locations of the back corner of home plate and the foul poles. But when I explored this possibility with Corey Schwartz at MLB.com I discovered another difficulty. Some of the maps were changed during the winter between the 2007 and 2008 seasons to eliminate the largest inconsistencies between the fields. It was great that MLB was trying to improve the accuracy of its data gathering, but for anyone who wanted to use multiple-year hit location data, it meant that there were potentially 60 data collections that needed to be adjusted instead of 30.
Rather than pursuing the graphical method of adjusting the data, I opted for an alternative method. I began with the assumption that certain classes of hit balls would have similar distribution patterns in all the parks over the course of a season. I could also impose some physical constraints to the data. For instance, ground balls fielded for outs by the infield would have to be located between the foul lines. So would be almost all the home runs. Line drives fielded by the pitcher and most ground balls would have to be fielded closer than 60 feet to home plate.
I also used Greg Rybarczyk’s HitTracker estimates of home run distance and angle to act as a reality check. With this conceptual framework I set out to normalize the data between fields for the two sets of data: 2005-7 and 2008.
In the past I had used the solver function of Excel for similar problems, but the number of variables for this problem exceeded its capacity, so I proceeded by hand in my existing Access database. Because of this, there is no guarantee that the numbers I ended up with are the very best possible. And, of course, there is no good way of checking how accurate my initial assumption of uniform hit ball distribution was.
But the results met most of my physical constraints. I actually ended up normalizing using only the non-bunt-ground ball outs to infielders. I excluded outfield information because the different fence differences in different parks caused outfield caught-ball average locations to vary between the parks. I did check the outfield caught-ball locations for each park, and they varied in a manner consistent with their outfield fence distances.
The data normalized nicely. The average angles of hit-balls outs for each park could be brought within 1.5 degrees of the league average at each position in almost all cases. The ground ball infield distances were almost always within 2 feet of the league average.
There were, however, two problems that emerged during the normalization process. I had assumed from my conversation with Cory Schwartz that the only change in data collection was the redrawing of the fields for 2008, so I normalized all the pre-2008 data together. But when I double-checked the individual yearly totals, it was apparent that something was wrong with the data for Coors Field for 2007. For some reason the home plate location was drastically off. Consequently I re-normalized the Coors 2007 data and individual numbers are given for Coors 2007 in the table.
The second problem involved the outfield distances. They were very different between pre-2008 and 2008. When the calculated data for infield ground outs was normalized to with a foot or so, the 2008 outfield out distances for 2008 were consistently much longer than those for pre-2008. This was true for each field in each park.
The only explanation that I could determine was that when the fields were redrawn they were not drawn with a consistent scale between outfield and infield. So a single distance factor multiplier that was correct for infield distances would be off when applied to the outfield. This wasn’t a problem for my use of the data in a fielding metric, but anyone attempting to use the outfield data for other purposes would have to establish separate multipliers for 2008 to normalize the outfield data to pre-2008.
Given the inherent inaccuracies of human observation of hit-ball locations and the recently reported greater than expected differences between STATS and BIS reported hit ball locations, I believe the normalized MLB data to be competitive with them for some purposes. The next article will discuss the limitations of the data and present a framework for using the data to construct a fielding metric. Below are home plate locations and distance multipliers that I am using for each field for both pre-2008 and 2008 hit location data.
2005-2007 MLB HIT LOCATION FACTORS
TEAM HOME-PLATE-X HOME-PLATE-Y DISTANCE-MULTIPLIER ANA 125.5 196.4 2.70 ARI 125.5 196.5 2.55 ATL 125.5 196.5 2.55 BAL 125.7 211.0 2.52 BOS 125.7 196.0 2.75 CHA 125.5 197.3 2.70 CHN 126.0 196.0 2.73 CIN 126.0 196.1 2.74 CLE 125.2 196.0 2.75 COL2005-6 124.5 194.4 2.77 COL2007 119.0 195.5 2.62 DET 125.9 198.7 2.70 FLO 125.8 197.0 2.72 HOU 125.2 196.2 2.80 KCA 125.5 197.3 2.71 LAN 125.8 195.8 2.70 MIL 126.4 194.9 2.70 MIN 125.1 196.2 2.65 NYA 125.7 195.2 2.80 NYN 124.6 195.4 2.83 OAK 126.1 197.2 2.60 PHI 126.4 198.8 2.62 PIT 125.2 197.5 2.58 SDN 125.4 197.3 2.70 SEA 125.3 197.2 2.76 SFN 125.7 195.0 2.64 SLN 126.0 197.6 2.70 TBA 125.4 198.0 2.65 TEX 126.5 195.4 2.75 TOR 126.2 197.0 2.68 WAS 126.8 197.5 2.64
2008 MLB HIT LOCATION FACTORS
TEAM HOME-PLATE-X HOME-PLATE-Y DISTANCE-MULTIPLIER ANA 125.5 198.8 2.78 ARI 125.1 201.5 2.37 ATL 126.8 201.3 2.40 BAL 125.9 201.5 2.65 BOS 124.6 200.4 2.65 CHA 125.0 200.2 2.62 CHN 125.4 201.1 2.58 CIN 126.3 200.8 2.64 CLE 125.5 202.4 2.66 COL 124.1 199.7 2.71 DET 125.5 201.0 2.71 FLO 124.5 200.1 2.66 HOU 125.2 201.6 2.68 KCA 124.6 195.1 2.86 LAN 125.7 199.1 2.77 MIL 125.1 198.1 2.69 MIN 125.2 197.7 2.72 NYA 125.7 197.4 2.85 NYN 125.3 197.1 2.95 OAK 125.5 200.4 2.61 PHI 125.5 200.5 2.71 PIT 125.3 202.3 2.60 SDN 126.2 199.4 2.63 SEA 125.8 199.8 2.82 SFN 125.8 197.9 2.75 SLN 125.7 195.4 2.81 TBA 123.5 199.4 2.61 TEX 125.5 199.8 2.70 TOR 126.7 197.0 2.83 WAS 125.1 200.5 2.64
Those of you who have explored the Gameday XML files know that hit locations are also given for the minor leagues. Normalizing that data so that it could be incorporated into a play-by-play database for minor league players could potentially improve our projections of their future performances in the majors. A process such as the one I used to normalize the major league data by park, using only infield ground ball outs, could certainly be used for minor league data as well.