Research Notebook: New Format for Statcast Data Export at Baseball Savant

The Statcast Search tool has undergone some recent changes. (via Baseball Savant)

Over the past few years, Daren Willman has made some of the pitch-by-pitch data generated by Statcast available on his Baseball Savant site. We have gotten only a taste of the kind of data that Statcast can produce, but even that taste is interesting and useful to work with.

The easiest way to get the data is through the Statcast Search query tool. After running a query–say, for all pitches thrown on April 24, 2017–you have the option of exporting the data as a comma-separated values (csv) file. As of April 24, the query output has changed in some not so subtle ways. Daren was kind enough to share with me the changes he was making ahead of time, which allowed me to quickly update the scrape_statcast_savant series of functions in my baseballr package.

I thought it would be helpful to outline what those changes are for people who have been working with the original data exports and plan on working with the new ones.

Overall Changes

The old export included 60 variables. The new file, however, has 75 variables. In some cases, the export includes brand new variables. In other cases, some of the existing variables are being renamed. Of those being renamed, however, some will continue to be reported, but others will be deprecated and will not show values going forward. Let’s break these out into two separate lists, shall we?

Statcast Data Export Column Changes
Column Names going away New Column Names
pitch_id release_speed
start_speed release_pos_x
x0 release_pos_z
z0 spin_rate_deprecated
spin_rate break_angle_deprecated
break_angle break_length_deprecated
break_length plate_x
px plate_z
pz inning_topbot
inning_top_bottom   tfs_deprecated
tfs tfs_zulu_deprecated
tfs_zulu pos2_person_id
catcher launch_speed
hit_speed launch_angle
hit_angle pos1_person_id
pos2_person_id.1
pos3_person_id
pos4_person_id
pos5_person_id
pos6_person_id
pos7_person_id
pos8_person_id
pos9_person_id
release_pos_y
estimated_ba_using_speedangle
estimated_woba_using_speedangle
woba_value
woba_denom
babip_value
iso_value

Looking at these two lists I noticed a few things:

  1. start_speed looks to now be release_speed. (Note: start_speed from 2008-2016 was generated by PITCHf/x. release_speed is being generated by Statcast. For more on this, see the discussion at Tom Tango’s site here.)
  2. The break variables are going to be deprecated.
  3. hit_speed will become launch_speed and hit_angle will be launch_angle.
  4. the strike zone coordinates are changing from px, px to plate_x, plate_z
  5. The player IDs for each position will be added as separate variables.
  6. They appear to be including variables with values and/or estimates of things like wOBA, BABIP, etc., for batted balls given angle, speed, etc.

To make life easier, I put together a simple crosswalk between the old and new data exports to show which variables are being renamed:

STATCAST DATA EXPORT COLUMN CROSSWALK
OLD COLUMN NAMES NEW COLUMN NAMES
start_speed  release_speed
x0  release_pos_x
z0  release_pos_z
spin_rate spin_rate_deprecated
break_angle break_angle_deprecated
break_length break_length_deprecated
inning_top_bottom  inning_topbot
tfs tfs_deprecated
tfs_zulu tfs_zulu_deprecated
catcher pos2_person_id
hit_speed   launch_speed
hit_angle   launch_angle
px        plate_x
pz        plate_z

You will notice that for every column name that was “going away” there is a replacement, except for pitch_id. That appears to no longer be available, even as a deprecated column.

New Variables and Values

In terms of the new columns that aren’t simply replacements for some of the old columns, we get some fun new data to play with.

First, we are getting the mlbamid’s for each position player and what position they were playing when the pitch was thrown. Now, we don’t get positioning data in the export (at least, not this year), but knowing who was playing where can be useful in many ways.

Second, the crew at MLBAM appears to be gearing up to release their own measures in terms of estimated Weighted On-base Average (wOBA) and Batting Average (AVE) based on exit velocity and launch angle. The variables that start with estimated_ appear to show the average wOBA or AVE based on batted balls with similar launch angles and exit velocity.

One item that is still not being released is horizontal spray angle on batted balls. Tango and the crew have said they will release that data at some point, but we don’t have it in this release.

You should also note that for some of the variables the type of values are a little different. For example, if you look at events and descriptions we now have more machine-friendly values (i.e. codes in lowercase without spaces, etc.). Take events, instead of “Grounded Into DP” we now have “grounded_into_double_play”. We also have null values in events where the pitch did not result in the end of the plate appearance. This is cleaner for analysis, but also might break any old code you have. Also, lining up existing data files with the new ones for these columns will require a little more TLC.

Omissions

Here is some R code that you can use to calculate horizontal spray angle yourself, based on where the MLBAM stringers plot where a batted ball was picked up by a fielder (based on the hc_x and hc_y variables in the export):

spray_angle <- with(df, round(
  (atan(
    (hc_x-125.42)/(198.27-hc_y)
  )*180/pi*.75)
  ,1)
)

 

Is this perfect? No, but it is pretty good in the absence of official sensor-based spray angle data. Note that -45 degrees is the left field line and 45 degrees is the right field line. (Note also that this calculation was originally produced by Jeff and Darrell Zimmerman.)

Still another thing to note is that, currently, the umpire variable is not populating. This column normally contains the mlbamid for the umpire that was behind the plate during the pitch. Daren has mentioned that this should be fixed and retroactively populated soon.

Merging Your Old Files with the New Files

Finally, if you are looking for a way to easily merge existing data you’ve downloaded from Baseball Savant with the new download format, here is a function in R that can set up your existing data to do that. Basically, it takes the current data, transforms the variables whose names are changing and adds in blank columns for the new variables, and then arranges the columns so that they are in the same order as the new download. Here’s the code (also available as a gist here):

format_old_savant_output <- function(df) {

  updated_names <- c("pitch_type", "pitch_id", "game_date", "release_speed", "release_pos_x", "release_pos_z", "player_name", "batter", "pitcher", "events", "description", "spin_dir", "spin_rate_deprecated", "break_angle_deprecated", "break_length_deprecated", "zone", "des", "game_type", "stand", "p_throws", "home_team", "away_team", "type", "hit_location", "bb_type", "balls", "strikes", "game_year", "pfx_x", "pfx_z", "plate_x", "plate_z", "on_3b", "on_2b", "on_1b", "outs_when_up", "inning", "inning_topbot", "hc_x", "hc_y", "tfs_deprecated", "tfs_zulu_deprecated", "pos2_person_id", "umpire", "sv_id", "vx0", "vy0", "vz0", "ax", "ay", "az", "sz_top", "sz_bot", "hit_distance_sc", "launch_speed", "launch_angle", "effective_speed", "release_spin_rate", "release_extension", "game_pk")
  
  colnames(df) <- updated_names
    
  new_cols <- c("plate_x", "plate_z", "pos1_person_id", "pos2_person_id.1", "pos3_person_id", "pos4_person_id", "pos5_person_id", "pos6_person_id", "pos7_person_id", "pos8_person_id", "pos9_person_id", "release_pos_y", "estimated_ba_using_speedangle", "estimated_woba_using_speedangle", "woba_value", "woba_denom", "babip_value", "iso_value")
  
  df[,new_cols] <- NA

  df <- df %>%
    select(pitch_type, game_date, release_speed, release_pos_x, release_pos_z, player_name, batter, pitcher, events, description, spin_dir, spin_rate_deprecated, break_angle_deprecated, break_length_deprecated, zone, des, game_type, stand, p_throws, home_team, away_team, type, hit_location, bb_type, balls, strikes, game_year, pfx_x, pfx_z, plate_x, plate_z, on_3b, on_2b, on_1b, outs_when_up, inning, inning_topbot, hc_x, hc_y, tfs_deprecated, tfs_zulu_deprecated, pos2_person_id, umpire, sv_id, vx0, vy0, vz0, ax, ay, az, sz_top, sz_bot, hit_distance_sc, launch_speed, launch_angle, effective_speed, release_spin_rate, release_extension, game_pk, pos1_person_id, pos2_person_id.1, pos3_person_id, pos4_person_id, pos5_person_id, pos6_person_id, pos7_person_id, pos8_person_id, pos9_person_id, release_pos_y, estimated_ba_using_speedangle, estimated_woba_using_speedangle, woba_value, woba_denom, babip_value, iso_value)
 
  df
}

 

From what it sounds like there will be more changes coming, but Daren has mentioned that the way they are setting things up future changes will be easier to deal with–essentially, just tackling new variables on to the end of the file export. That should make lining up any existing data you may have even easier.

References & Resources


Bill leads Predictive Modeling and Data Science consulting at Gallup. In his free time, he writes for The Hardball Times, speaks about baseball research and analytics, has consulted for a Major League Baseball team, and has appeared on MLB Network's Clubhouse Confidential as well as several MLB-produced documentaries. He is also the creator of the baseballr package for the R programming language. Along with Jeff Zimmerman, he won the 2013 SABR Analytics Research Award for Contemporary Analysis. Follow him on Twitter @BillPetti.
11 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Dennis Bedard
6 years ago

Wow! I am living in a different universe. I remember circa 1967 when I could evaluate a player based almost solely on the way Topps depicted him with a bat or ball in his hands. Now I need a refresher course in advanced astrophysics to figure this out.

Jeff Zimmerman
6 years ago
Reply to  Dennis Bedard

I disagree. I have gone head deep into this data but every time it is to answer a question. Don’t go into the data looking for questions. First, start with a good question and maybe the answer lies in the astrophysical data. Probably not.

Concentrate on creating a good question and then go find the data. If you don’t know where to find the data, ask. Us writers get bored being alone in our mom’s basement.

Rally
6 years ago

Don’t mean to complain because this is great stuff, and I’m grateful that MLB is willing to share so much data. But one thing I noticed is the umpire field is null in the downloads. Last year the umpire ID was there.

Is that an oversight or an intentional removal?

James
6 years ago

I like to bring these files into a spreadsheet to play around with. with pitch_id and tfs removed I don’t see how one would get the pitch sequencing back into the correct order. sv_id looks similar to tfs but many events are missing a value in that field. Am I missing something?

James
6 years ago
Reply to  James

Any idea if either of these fields will be reinstated?

Don Hessey
6 years ago
Reply to  James

sv_id is the first record in time of the pitch recorded by pitchfx or statcast. That would be your unique id for the pitch, combined with the game_pk field you will have a unique pitch id. I don’t think you’ll want to include the records without a sv_id in launch angle or launch speed calculations as they look to be done by a stringer and not generated by the statcast software.

Jeff
6 years ago

Will the pitch values (vx0 vy0 vz0 ax ay az) be populated going forward?

Michael Liu
6 years ago

Great stuff. Just a quick question. Why do you multiply by 0.75 to find the spray angle? Thanks!

Bill
6 years ago

A possibly related question… anyone know what is being measured by hc_x and hc_y in the Baseball Savant data? What are the units?

Buy Yeezy Boost 350 V2
6 years ago

Looking to mimic the look of a Brown baseball glove, the shoe comes constructed out of a premium leather upper equipped with baseball glove-inspired woven detailing on the side panels and heel