Over the past few years, Daren Willman has made some of the pitch-by-pitch data generated by Statcast available on his Baseball Savant site. We have gotten only a taste of the kind of data that Statcast can produce, but even that taste is interesting and useful to work with.
The easiest way to get the data is through the Statcast Search query tool. After running a query–say, for all pitches thrown on April 24, 2017–you have the option of exporting the data as a comma-separated values (csv) file. As of April 24, the query output has changed in some not so subtle ways. Daren was kind enough to share with me the changes he was making ahead of time, which allowed me to quickly update the scrape_statcast_savant series of functions in my baseballr package.
I thought it would be helpful to outline what those changes are for people who have been working with the original data exports and plan on working with the new ones.
The old export included 60 variables. The new file, however, has 75 variables. In some cases, the export includes brand new variables. In other cases, some of the existing variables are being renamed. Of those being renamed, however, some will continue to be reported, but others will be deprecated and will not show values going forward. Let’s break these out into two separate lists, shall we?
|Column Names going away||New Column Names|
Looking at these two lists I noticed a few things:
- start_speed looks to now be release_speed. (Note: start_speed from 2008-2016 was generated by PITCHf/x. release_speed is being generated by Statcast. For more on this, see the discussion at Tom Tango’s site here.)
- The break variables are going to be deprecated.
- hit_speed will become launch_speed and hit_angle will be launch_angle.
- the strike zone coordinates are changing from px, px to plate_x, plate_z
- The player IDs for each position will be added as separate variables.
- They appear to be including variables with values and/or estimates of things like wOBA, BABIP, etc., for batted balls given angle, speed, etc.
To make life easier, I put together a simple crosswalk between the old and new data exports to show which variables are being renamed:
|OLD COLUMN NAMES||NEW COLUMN NAMES|
You will notice that for every column name that was “going away” there is a replacement, except for pitch_id. That appears to no longer be available, even as a deprecated column.
New Variables and Values
In terms of the new columns that aren’t simply replacements for some of the old columns, we get some fun new data to play with.
First, we are getting the mlbamid’s for each position player and what position they were playing when the pitch was thrown. Now, we don’t get positioning data in the export (at least, not this year), but knowing who was playing where can be useful in many ways.
Second, the crew at MLBAM appears to be gearing up to release their own measures in terms of estimated Weighted On-base Average (wOBA) and Batting Average (AVE) based on exit velocity and launch angle. The variables that start with estimated_ appear to show the average wOBA or AVE based on batted balls with similar launch angles and exit velocity.
One item that is still not being released is horizontal spray angle on batted balls. Tango and the crew have said they will release that data at some point, but we don’t have it in this release.
You should also note that for some of the variables the type of values are a little different. For example, if you look at events and descriptions we now have more machine-friendly values (i.e. codes in lowercase without spaces, etc.). Take events, instead of “Grounded Into DP” we now have “grounded_into_double_play”. We also have null values in events where the pitch did not result in the end of the plate appearance. This is cleaner for analysis, but also might break any old code you have. Also, lining up existing data files with the new ones for these columns will require a little more TLC.
Here is some R code that you can use to calculate horizontal spray angle yourself, based on where the MLBAM stringers plot where a batted ball was picked up by a fielder (based on the hc_x and hc_y variables in the export):
Is this perfect? No, but it is pretty good in the absence of official sensor-based spray angle data. Note that -45 degrees is the left field line and 45 degrees is the right field line. (Note also that this calculation was originally produced by Jeff and Darrell Zimmerman.)
Still another thing to note is that, currently, the umpire variable is not populating. This column normally contains the mlbamid for the umpire that was behind the plate during the pitch. Daren has mentioned that this should be fixed and retroactively populated soon.
Merging Your Old Files with the New Files
Finally, if you are looking for a way to easily merge existing data you’ve downloaded from Baseball Savant with the new download format, here is a function in R that can set up your existing data to do that. Basically, it takes the current data, transforms the variables whose names are changing and adds in blank columns for the new variables, and then arranges the columns so that they are in the same order as the new download. Here’s the code (also available as a gist here):
From what it sounds like there will be more changes coming, but Daren has mentioned that the way they are setting things up future changes will be easier to deal with–essentially, just tackling new variables on to the end of the file export. That should make lining up any existing data you may have even easier.
References & Resources
- Baseball Savant, Statcast Search
- Bill Petti, Github, baseballr Package
- Tom Tango, Tangotiger Blog, “Pitch velocity: new measurement process, new data points”
- Bill Petti, GithubGist, format_old_savant_output.R