Wednesday, April 07, 2010
Gameday-PITCHf/x changes for 2010
Posted by Mike FastEvery year brings changes and improvements to MLBAM's Gameday application, and many of them have some bearing on PITCHf/x and related analysis. Let me share with you the differences I've noticed so far in 2010.
First, Cory Schwartz of MLBAM notified some of us in February that some redundant information was going to be removed from the directory structure.
Folks, just wanted to give you a heads-up that we are deprecating the individual batter and pitcher .xml files published under these directories:Fortunately, Dan Brooks has been able to adapt his site to the new structure for 2010.
http://gd2.mlb.com/components/game/mlb/year_$YEAR/month_$MONTH/day_$DAY/gid_*/pbp/batters/
http://gd2.mlb.com/components/game/mlb/year_$YEAR/month_$MONTH/day_$DAY/gid_*/pbp/pitchers/
If you’re using any data in those files you should be able to get it from other files in the gd2 directories, but we no longer need or use these for any of our internal purposes or products. In addition, we are deleting the 2008 and 2009 files from our servers to free up the disc space for other content.
Ross Paul shared that he would be deploying pitcher-specific neural nets for MLBAM's pitch classification.
The Gameday PITCHf/x data also has a few new fields this year. In the at bat element, there is a new field called "start_tfs". This is a time stamp in the Eastern Time Zone. It matches up more closely with the accurate actual time than does the sv_id time stamp, which can be a few minutes off. Cory tells me that this field wasn't intended for analysis and is used internally by MLBAM. Speculation is that this field may be used for syncing up the Gameday data with other data sources, such as video. Since it's there in the data, I wouldn't be surprised if someone finds an analytical use for it, too.
The pitch element has three new fields: "nasty", "zone", and "cc". The zone field appears to correspond to the location of the pitch based on the boxes into which the Gameday app divides the strike zone for its hot/cold zone graphics. The "cc" field is a comment field that appears to my highly-trained eye to be auto-generated, probably also based on the hot/cold zone information that MLBAM tracks. Here are some examples of the sparkling wit and insight produced by the auto-commenter:
A.J. Burnett didn't read the scouting report; Adrian Beltre loves four-seam fastball in that zone.(Apparently Ted Williams was right.)
A.J. Burnett didn't read the scouting report; Jacoby Ellsbury loves sinker in that zone.
A.J. Burnett didn't read the scouting report; Victor Martinez loves four-seam fastball in that zone.
Tim Lincecum has thrown 75 pitches; he holds opposing hitters to a .000 average in the first 75 pitches and .000 after that.
Vicente Padilla didn't read the scouting report; Jeff Clement loves curveball in that zone.
Vicente Padilla didn't read the scouting report; Lastings Milledge loves four-seam fastball in that zone.
The "nasty" field is presumably a crude attempt to calculate how hard to hit a particular pitch was, on a scale of 0-100. My initial cursory look at the data indicates that they are calculating the "nasty" factor mostly based on the location of the pitch, a linear calculation of how close it is to the edges and away from the heart of the zone. For the fastball, MLBAM does not appear to be including anything related to the movement or speed of the pitch into the "nasty" factor. For the curveball, they appear to be rating sweeping curveballs as significantly more nasty than 12-to-6 curveballs. Anyway, I'm not sure that any of this matters as more than a curiosity. As a sabermetric community we have much better approaches available for measuring the nastiness of a pitch.
| A.J. Burnett throws a knuckle curve against the Angels in Game 5 of the 2009 ALCS. (Icon/SMI) |
Finally, the MLBAM pitch classification have introduced a new bucket this year: KC, the knuckle curve. I'm not sure why they did this. I suspect it has something to do with the scouting data they got for their training data, although I haven't asked Ross about it. For my own classifications, I do not classify the knuckle curve separately from other curveballs. I don't generally classify pitch types separately based on grip unless the grip differences actually produce substantial spin movement differences (e.g., two-seam and four-seam fastballs). I don't classify palmballs, forkballs, circle change-ups, three-finger change-ups, and Vulcan change-ups separately. I do occasionally classify hard curves and slow curves separately when they are two distinct pitch types for the same pitcher, as they are for Roy Oswalt, for example. But the knuckle curve, also called the spike curve, moves just like other curveballs.
A.J. Burnett's curve is the only pitch that I've noticed so far that MLBAM is labeling a knuckle curve, which handily gives me an excuse to include an image of a pitcher's grip, one of my favorite topics.
Mike Fast is a Royals fan who enjoys investigating baseball questions using data of many sorts. He is a member of Complete Game Consulting. He welcomes comments via e-mail.








The pitch classifications have been pretty crude so far. Even more so than usual, it seems. For example, over the first two games, every David Robertson fastball (which sometimes cuts slightly) that had positive horizontal spin deflection has been misclassified as a curveball. Personally, I don’t mind going back and working on the classifications myself, but just out of curiosity, do you know if the algorithm was changed at all this year? In the early going, it’s been pretty rough.