Pitchers and swing area

by Josh Weinstock
November 18, 2011

Two weeks ago I introduced a method to calculate the area of a batter’s swing zone, which you can view here. I created the stat because O-swing—the percentage of swings on pitches outside the strike zone—is so… unsatisfying. O-swing gives us no information about how far a batter is chasing. Swing area, on the other hand, can tell us in easily understandable terms how large a batter’s swing zone is. A more detailed explanation is given in the article, but I calculate swing area by finding the area of the 22.4 percent swing contour.

While analysis of swing variance yielded intriguing results, swing area proved to be somewhat disappointing as a measure of plate discipline. For batters it measures the same skill as O-swing, and does not do a better job than O-swing of predicting walk rates.

Thanks again to Lucas Apostoleris for coming up with the idea to perform this kind of analysis.

But what if we apply this analysis to pitchers?

I re-ran the swing area analysis on all pitchers who threw at least 1,000 pitches in 2011, including playoffs, with the intention of getting as many pitchers as possible. The cutoff is arbitrary, but the calculations involve regression that can be pretty sensitive to outliers if the sample is too small. The 1,000-pitch cutoff also creates some sampling bias; the only pitchers who threw at least 1,000 pitches in 2011 were the ones who were either healthy or good enough to do so. To demonstrate the effects of this sampling bias, here is the distribution of strikeout rates among the pitchers in the data set:

See that long tail on the right side of the graph? That tells us that the distribution is skewed positively, which is something to keep in mind later when we look at the relationships between various plate discipline metrics and strikeout rate.

I should also note that, for the sake of transparency, I have slightly tweaked my calculation since I introduced it; I increased the number of bins that I was using, making the calculation a little more accurate.

Back to swing area. After running the calculations, here are the top five starters in swing area (feet):

Carlos Carrasco 14.7 
Josh Beckett 	13.1
Kevin Correia 	12.2
Douglas Fister  12.1
Jeff Niemann    11.5

A major surprise at the top of the list is Carlos Carrasco. His placing is surprising because he neither records a lot of strikeouts nor is rated well by O-swing. He may be a good example of some of the limitations of this calculation method—I found a decent amount of his pitches that were clearly PITCHf/x errors in my data base. Not enough errors that we need to be very worried about the integrity of the data, but enough that we need to be mindful of the problem. Roy Halladay had the seventh largest swing area.

The top five relievers who threw at least 1,000 pitches in 2011 are:

Cory Luebke     14.7
Drew Storen     14.2
Nick Masset     13.9
Heath Bell      13.6
Jonny Venters   12.9

The top five here are far less surprising. I wasn’t quite sure what group to put Luebke in as he started the year and finished as a starter, but most of his appearances came as a reliever so I that’s where I kept him.

The top 10 overall pitchers are:

Cory Luebke     14.7
Carlos Carrasco 14.7 
Drew Storen     14.2
Nick Masset     13.9
Heath Bell      13.6
Josh Beckett 	13.1
Jonny Venters   12.9
John Axford     12.8
Jim Johnson     12.2
Kevin Correia 	12.2

And the pitchers with the five smallest swing areas are:

Brad Penny      6.5
Fausto Carmona  6.5
Casey Coleman   6.5
Zachary Britton 6.4
Tyler Chatwood  6.2

The overall distribution looks like this:

The distribution suffers from the same skewness as the distribution of strikeout rates. This has the unfortunate effect of making it difficult to measure the center and spread of the distribution. In cases like this mean and standard deviation do not do such a great job of describing the distribution, so we should use other methods. The median swing area is 8.3 feet, and the first quartile—the 25th percentile—is 7.6 feet. The third quartile—the 75th percentile—is nine feet. Much less skewness was present with hitters.

Comparing to O-swing

With hitters, I found a very strong relationship between O-swing and swing area. With pitchers….not so much. The relationship between O-swing and swing area is weak, with a correlation coefficient of .42. I looked only at the relationship between the two variables after ignoring all swing areas greater than 11 feet to try to combat some of the effect of skewness:

The relationship is much weaker than expected. This suggest that there may be a difference between the ability to get batters to chase outside the zone and the ability to get batters to chase far outside the zone. With hitters we found no evidence for two separate skills.

A Hardball Times Update

by RJ McDaniel

Goodbye for now.

But which metric, O-swing or swing area, tells us more about a pitcher’s strikeout rate? After again removing removing outliers (swing areas > 11), we find a correlation of .376 between swing area and strikeout rate, with swing area being significant at greater than a 99 percent level. The coefficient of swing area is .019. This means that for every increase in the values of swing area by one, we can expect a corresponding increase in strikeout rate by a little less than two percent. The R-squared is .14, meaning that the values of swing area explain 14 percent of the variation in strikeout rate. O-swing explains just 2.4 percent of the variation in strikeout rate, which is lower than expected. O-swing is also statistically significant, with a p-value of .01 (for a two-sided test).

I also tested these two metrics with a measure of overall ability, in this case FIP. Both variables have a significant relationship with FIP, but swing area explains 15 percent of variation while O-swing explains a little less than three percent. Both coefficients are negative, meaning that there is a negative relationship between both variables and FIP, which is to be expected—the more you can get a batter to chase, the better of a pitcher you are. The coefficient of swing area is -.29. This means that for every one foot increase in swing area, we can expect FIP to decrease by .29.

You can see both of these relationships below:

I also tested both O-swing and swing area with BABIP, and neither were close to significance. This is important because often we attribute a pitcher’s ability to induce weak contact to be in part their ability to get a batter to chase, but there is no evidence of this relationship.

Finishing thoughts

Swing area tells us much more about both a pitcher’s overall ability and more specific ability to record strikeouts than O-swing does. It seems that how far the batter is chasing does contain valuable information, and that O-swing may not be so useful for evaluating pitchers—it neither tells us much about strikeout rate nor BABIP ability. But why is that swing area is so much more important for pitchers than hitters? I’ll expand on the explanation that I’m working on in a later post.

I have made the full results available here via google doc:
spreadsheet

References & Resources
*PITCHf/x data from MLBAM via Darrel Zimmerman’s pbp2 database and scripts by Joseph Adler/Mike Fast/Darrel Zimmerman
*Strike zone definition from Mike Fast

12 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Dave Studeman

12 years ago

Excellent stuff, Josh! This looks like a very useful tool for pitchers. Thanks, too, for the data.

Millsy

12 years ago

Really neat application. I’m curious about calculating the swing areas, as there may be an issue with comparability across individual swing zones (especially in smaller sample sizes). I’ve talked about this before, but I also know that Josh has read those caveats carefully.

How did you choose the smoothing parameters and how did you deal with outliers in the data in calculating the area (i.e. swinging at one pitch in the dirt, when there are only two of them thrown by a given pitcher)? This may be producing a large portion of the skew in your data, and if not properly accounted for, could result in some very large swing areas for the pitchers.

Millsy

12 years ago

PS – I did notice that you restricted the sample size, so maybe it’s less of an issue. 1,000 pitches can sometimes still be small, but probably sufficient for this purpose.

Are you cross validating your smoothing parameter for each individual pitcher?

Albert Lyu

12 years ago

Excellent work, Josh, as always.

Millsy, do you have Machine Learning experience? I’ve been learning a bit about how ML can help with automatically identifying the
“just right” parameters for logistic regression models, and not sure how that would apply for Josh’s swing area models here, but deciding how to cross validate each set of data and identifying the right smoothing parameter can improve the results.

This is great stuff.

Josh Weinstock

12 years ago

I used the mgcv package with cross validated smoothing parameters(a method that I heard about from you, thanks!). I’m sure outliers are causing some of areas to be larger than they should be, and I think that Carrasco is one of these.

Admittedly I’m not entirely sure how big of a problem this is, but I do not see any reason to believe that it’s a big problem with most pitchers. I messed around with the calculation methods a lot to see how it would change the swing areas, and none of the other methods I tried were able to create results that were very different from the numbers in the final spreadsheet.

Millsy

12 years ago

Albert,

I’ve worked with some machine learning algorithms in R (namely, Random Forests). However, mgcv (as Josh says above) is probably the best bet. Was just checking that out.

As for how big of a problem, I think restricting it to no less than 1000 pitches and using the Gaussian link (rather than binomial) makes the computation a bit safer. I think it’s a great application.

Out of curiosity, I’d be interested to see if players’ rate of change in swing area across contour. Are these relatively constant across players?

Millsy

12 years ago

PS – Really cool, and given the results, it seems to be a totally reasonable use of the methods!

Josh Weinstock

12 years ago

Millsy,

I did actually use the logit link, and not the gaussian link. I am also not entirely sure what you mean by rate of change in swing area across contour.

Millsy

12 years ago

Hmm…maybe I’m not thinking clearly on the rate of change thing. Let me try to articulate it a little more later…

Bill Waite

12 years ago

It would be interesting to see the interaction between pitcher swing area and hitter swing area.

How do pitchers in the top quartile (highest swing area) do against hitters in the top quartile (lowest swing area)?

Are the great pitchers making all hitters chase a little bit more than they otherwise would, or are they just making the worst hitters chase farther away than other pitchers are willing to try?

MGL

12 years ago

Very nice work Josh! Very nice indeed!

Josh Weinstock

12 years ago

Thanks everyone.

And Bill, I agree, that’s definitely something worth looking into.

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG