Yes, we actually classified every pitchby Dan Brooks
February 03, 2012
Long term PITCHf/x data has always been difficult to find online. There are several existing sources available: Fangraphs has some of it, but not everything you might want (I’m sure they will tomorrow, just for that). Texas Leaguer has had a fantastic tool up for quite some time now, but still, there are places it lacks functionality. Josh Kalk used to have a wonderful website, but he moved on to the Rays. BrooksBaseball has never really had seasonal data per se, despite having data that spans each season.
We think that, generally, there are several reasons the data has been difficult to find in a long-term format. First, there is the technical limitation: The PITCHf/x dataset is large—millions of pitches. This means that dynamic solutions—most PitchFX systems are dynamic—must have very good caching systems, well written databases, and powerful hosting solutions.
But beyond raw computing, which we can solve using some combination of duct tape and Moore’s law, there are really two critical issues that are unfortunately intertwined. The first of these issues is a data quality limitation. The PITCHf/x data is beautiful and we have nothing but exemplary things to say about the people who both collect it and make it free to access. I hope I do not overstep my bounds when I say that Cory and his team at MLBAM and Greg and his team at Sportvision have probably contributed more to baseball research simply through the availability of this data than all but the most accomplished Sabermetricians.
That said, there are simply issues with the data, most of which are due to park-specific camera quirks that make individual games more grokkable than complete months or seasons (see Chris Carpenter’s data in the World Series, both in Texas and in St. Louis, for an example).
The second of these issues relates to pitch classification. This has become progressively less and less of an issue as the brilliant minds at BAM have worked to improve their classification algorithms, which have gone from mediocre to damn good in a very short amount of time. Yet still, if you’re going to average across a set of data to say something about a set of labels, the quality of what you report depends heavily on your data labels.
These last two issues are related in the following way: Chris Carpenter’s data includes park specific errors in both St. Louis and Texas, so, it is classified by the automatic algorithm differently in St. Louis and Texas despite being internally consistent. Therefore, quality issues propagate through various parts of the system. So, you really want a very qualified human to do the tagging. But, good luck convincing THT's Harry Pavlidis or Lucas Apostoleris to tag three and a half million pitches, because that would be insane.
What’s that you say? They’ve actually done that?! By that, I mean individually tagged every pitch. This isn’t a very efficient solution, but it escapes the problems above by putting a human hand on the classification problem. When the cameras capture internally consistent data with park specific quirks, Harry can find adjust for those quirks and tag the pitches correctly. The raw numbers in the data aren’t changed, but the labels are—solving at least part of the problem that exists in the dataset. It’s not a perfect solution, but it allows us to present you with an enormous database of properly tagged, seasonal PITCHf/x data.
Basically, this project started in the following way. Harry asked me to help write an automatic algorithm for detecting clusters of bad classifications, because humans make different kinds of mistakes than computers. Whereas a computer might misclassify Chris Carpenter because of Park Specific Camera Quirks, a human might misclassify CC Sabathia’s rare curveball as a slider because he didn’t think they were two pitches when looking at individual game datasets, though they become apparent when all of the data is shown at once.
I told him not to worry about writing an algorithm to find these problems, because the internet is beautiful. He asked if I was drunk. I said no.
And then, we arrived at the solution: put it online. Because no algorithm was going to be as good as human eyes, and people were going to want to see this.
This is a little script that displays seasonal PITCHf/x data using labels tagged by the THT writers. It includes over a dozen different ways to cut and slice the data, with some of the most common PITCHf/x plots (e.g., Horizontal Movement x Speed) and some uncommon ones (e.g., fancy trajectory plots with nice art).
For example, let’s take a pitcher: Clay Buchholz. Click on the name, or you can find him by navigating to BrooksBaseball.net and typing “Buchholz” in the search bar.
Let’s look a little at his changeup, which has always been a stellar pitch.
On the above chart that shows pitches separated by Horizontal Movement and Speed, the changeup (CH) is highlighted here in blue—about 80mph with slight horizontal fade. A Trajectory and Movement table at the top of the page tells us 80.99 miles per hour on average, to be exact.
You might be interested to know how much it drops on its way to the plate. This graph shows that—with gravity included— the Buchholz changeup drops off the table over 2 feet—the Trajectory and Movement table, again, tells us 26.01 inches!
One thing that’s incredible about the Buchholz's changeup is how often it generates whiffs. The Pitch Outcomes table tells us that batters whiff 22.9 percent of the time when he throws it, a very high number. And the Sabermetric Outcomes table tells us that the Whiff/Swing on the changeup is 45.9 percent! That number is even higher for right-handed hitters (52.25 percent!) and interestingly higher still on 0-strike counts. When batters are expecting fastball and Clay throws the Change, there have been 91 recorded swings, and 59 of those—64.84 percent!!—have been whiffs.
What’s part of what makes it so good? The trajectory of the pitches nearly overlap until the very last part of the pitch trajectory, making it almost impossible to distinguish a change from a fastball.
And even when batters do put it in play, the pitch generates ground balls at a 44.7 percent clip.
Despite this dominance, it looks like he’s throwing fewer changeups as he moves to throwing more and more of his newfound cutter.
OK, so that wasn’t a very useful discussion—you all know Clay Buchholz has a great changeup. It was just meant to demonstrate some of the power of this little app.
Now that you’ve seen it in action, your mission, if you choose to accept us: Help us validate.
1) Run around like children in Wonka’s chocolate factory and consume as much data as possible (do not drink directly from the waterfall).
2) Using your knowledge of pitchers that you watch every day, let us know what looks wrong by starting threads on the BrooksBaseball.net Forums (we’ll have a thread for each pitcher, there are examples already there).
3) Use these graphs and tables however you want in any of your favorite blogs. Consult legal counsel first, and sign this waiver releasing us from liability.
4) Help us by sponsoring your favorite pitcher or two if you think what we’ve done is cool. This is an important step. Just like on Baseball Reference, you can add your own witty message to appear every time someone pulls up a card.
We really, really hope you enjoy this—we think it will be a great addition to the baseball resources on the Internet. Please don’t hesitate to contact us by leaving comments on this article, by leaving messages on the forums, or dropping us an email.
In addition to fantastic work by Harry and Lucas, a tip of the hat goes to Dustin Kikuchi.
Dan Brooks is a Neuroscientist at Brown University. He operates BrooksBaseball.net and eats Fried Chicken during every Red Sox game, especially in September. Come follow him @brooksbaseball.