Scouting the Minors Pitch by Pitch: Power

Chris Davis' power translated from the minors to the majors. (via Minda Haas Kuhlmann)

Chris Davis’ power translated from the minors to the majors. (via Minda Haas Kuhlmann)

In the first installment of this series, we looked at Swinging Strike percentage, to see if we can predict major league talent, using only pitch by pitch data aggregated by minor league stringers. Surprisingly, the data were pretty solid, at least at Double-AA and above and after 2012. This research was largely influenced and inspired by Chris Mitchell’s KATOH.

Today, we will look for predictive signals within batted ball data and hope to uncover a couple of otherwise unknown potential power hitters; also we’ll be scrolling through myriad scatter plots, of which I am overly fond, as a tool to show correlations.

Housekeeping

As always, we begin with a look at the data at a macro level, to gauge its quality. All data herein are sourced from the MiLB pitch-by-pitch gameday files and are subject to recording bias and human error; for this article, we will make only minimal attempts to smooth or correct the data, as well as filter out stadia that are unreliable.

I am still amazed at the sheer depth of talent that goes into the final product in major league baseball; there are seven levels of 30 teams, all staffed with extremely talented baseball players, of which only a select few ever amount to useful major league talent. In scouting parlance, it’s said that you can’t teach 80 power, thus we begin our research by looking at fly ball distance, to scout the stat line for 80 power.

HR + FB (OUT) DISTANCE
Class 2008 2009 2010 2011 2012 2013 2014 2015 2016 Grand Total
Majors 292.69 293.65 293.31 284.85 287.13 284.72 281.58 281.63 287.43 287.81
AAA 276.72 274.52 276.93 276.17 274.67 276.69 275.94
AA 270.48 267.65 267.11 265.91 265.62 269.24 267.67
A+ 253.71 255.54 258.85 254.62 266.21 267.05 259.16
A 248.09 248.68 253.82 254.21 262.06 264.78 255.19
A- 242.52 241.06 243.63 242.90 262.30 264.23 249.34
R 244.24 242.01 243.65 239.40 265.52 262.30 249.70
SOURCE: MLB Advanced Media
Data are for trajectories flagged as fly balls, filtered to those caught or hit for a home run. Base hits are excluded since they are recorded where they are fielded rather than where they landed.

The first observation: there appears to be very consistent measurement from Double-A to the majors, going back all the way to the beginning of when there were data. For High-A and below, it looks like there was a change in recording methodology for the 2015/2016 seasons, so we will need to adjust the 2011-2014 seasons up to have consistent numbers to run correlations on. There are also a few instances where the play description will indicate a fly ball, but the result would be recorded as a “ground out,” which gives me an excuse to throw up my least favorite type of viz – the pie chart:

 

pie-chart

By reading the text you can make out that fly balls are classified as ground
outs about 0.01 percent of the time and by looking at the graphic, you can make out that fly out is pac-man eating the other result types. In truth, I just wanted to throw a pie chart up to balance out all the scatter plots to come.

Scatter plots Galore

Let us begin with the forewarned scatter plots, with MLB on one axis, compared to every level below it.

MLB HR+FB Distance to AAA HR+FB Distance (Pitchers) | R2 = 0.01

mlbaaa-pitchers

We begin with pitchers, to demonstrate that we see correlations where we would expect to see them and see no correlation when we expect to see none. In this instance, there is practically no predictive power for a pitcher’s ability to suppress fly ball distance in Triple-A to the majors, which is what we’d expect from the data.

MLB HR+FB Distance to AAA HR+FB Distance (Batters) | R2 = 0.50

mlbaaa-50p

We see a surprisingly strong correlation for power from Triple-A to the majors; the sample is filtered to hitters who have had at least 50 flyball outs plus homers in both the majors and Triple-A. Joc Pederson and Chris Davis had by far and away the most power of any hitters in Triple-A recorded since 2011, with guys like Munenori Kawasaki, Dee Gordon, Billy Hamilton and Jarrod Dyson filling the opposite end of the spectrum. I say surprising because the correlation exists, but because the data are clean enough to produce such a strong signal. In fact, it produces a slightly better than looking at HR/BIP%:

MLB HR/BIP% to AAA HR/BIP% (Batters) | R2 = 0.48

mlbaaa-hr

I would like to underscore this point. Minor league stringer data pertaining to where balls in play are caught or land for a home run, correlate to the batted ball distance in the majorsto a greater degree than their HR/BIP% does; all this without any attempt to clean the data for anomalies.

Speaking of anomalies, let’s take a quick look at which Triple-A ballparks had the highest average distance:

aaa-ballparks

Clearly there are some material differences that we should probably be adjusting for, especially Herschel Greer Stadium; just by filtering out Herschel Greer, the Triple-A/major leaguecorrelation improved to 0.52, which is a significant improvement given that it required filtering out only one stadium. The wide discrepancy across stadia suggests that there is potential to improve the model by adjusting for venue.

I did a quick and dirty adjustment controlling for park effect, but wle it improved the model to 0.57, which is slightly better, this should be done by mapping the minor leaguestadium images to actual physical dimensions, rather than using a one-size fits all approach. That would be quite an undertaking, so for now, we will live with the imperfect data (unadjusted), which surprisingly still send quality signals.

MLB HR+FB Distance to AA HR+FB Distance (Batters) | R2 = 0.22 to 0.29

mlbaa

I put a range of R2 values here since I had to dial back the pitch filter to show enough names. When set at 20 pitches/level we get the 0.22; when we crank it up to 50, it moves the needle to 0.29. If we increase it too much we are introducing more survivor bias, too low and we introduce too much noise, so the true signal is probably somewhere in between.

Either way, the names at the far right are very interesting: Soler, Baez, Harper, Springer, J.D. Martinez, Bryant, Santana, Schwarber and Goldschmidt, sprinkled in with Schebler, Parrino, Flaherty and Olt. Putting aside the objective R2 numbers and lookingthis qualitatively, the success of the players at the top end of the distance spectrum is hard to ignore. Let’s take a closer look and change the color of the bubbles to black for batters who were 24.2 years old or younger and blue if they were older while in Double-A:

mlbaa-closeup

The hit rate on the top distance hitters being successful major league ball players is incredibly high, especially when you filter out the older players. For instance, I’d take a closer look at Scott Schebler, who may have more upside than many are ascribing to him. What happens if we take a similar approach and look at the Triple-A hitters again, but apply a filter that considers only Triple-A batters younger than 25?

mlbaaa-closeup

When we look at just the younger players, we get Flowers, Garcia, Pederson, Belt, Gyorko, Santana, Plouffe, Bogaerts, d’Arnaud, Bregman and Canha. Not quite as impressive as the Double-A list, but again, a very high success rate, given that we are looking at just one metric — distance on fly balls aught for outs or hit for home runs. Let’s keep an eye on the top end of the power spectrum as we move down the minor league ladder.

MLB HR+FB Distance to A+ HR+FB Distance (Batters) | R2 = 0.19 to 0.25

mlba

Let’s look at the top names again: Reed, Sano, Gattis, Piscotty, Correa, Lake, Vargas, White, Conforto (circle next to White) and Seager. Correlations be damned, being at the top end of the list puts you in a very select group of players. Trevor Story is a complete outlier in almost every chart he shows up in, which is more than just the Coors effect. Some young prospects (Pederson, Polanco, Soler) had mediocre power numbers at this level, which might suggest that success thereis a very strong signal but that lack of success may not be. Let’s move on to Class A…

MLB HR+FB Distance to A HR+FB Distance (Batters) | R2 = 0.14

mlba

The correlation still holds, even as low as Class A ball. Here at the top end we see A.J. Reed, Baez, Vargas, Trea Turner, Willson Contreras and Rob Refnsyder, with Correa, Duvall, Bird and Kolten Wong following closely. However, there are a lot of significant names such as Sano, Yelich, Travis, Mazara and some dude named Bryce Harper indicating that not being at the top end at this level doesn’t say as much as being at the top end does.

MLB HR+FB Distance to A- HR+FB Distance (Batters) | R2 = 0.04

mlba

R2 for Low-A is really low, but it does continue the trend of top-end guys as good major leaguers. Of the top guys, ignoring boom-or-bust Gallo, all have WRC+ of at least 100 in the majors, including Richie Shaffer and Mac Williamson. There is a large element of survivor bias here, since we’re only looking only at guys who have played in the majors. But if you make the majors and of those who made the majors you were among the leaders in your class of minors, you have a pretty good shot at being a very good hitter.

MLB HR+FB Distance to Rookie Ball HR+FB Distance (Batters) | R2 = 0.16 to 0.29

mlbr

I’m not buying the strong correlation, but I’m not totally discounting it either. Joey Gallo alone skewed the R2 from 0.16 to 0.29, hence my distribution above, though that is still a pretty impressive correlation given that we’re talking about rookie ball and whether the data are (a) clean enough and (b) predictive enough to tell us something about what a player might look like at the major leagues, without knowing any of his outcome stats (such as home runs.

As we’ve gone down the various levels, we see a natural progression in the strength of the signal, beginning at roughly 0.5 at Triple-A to .25, 0.20, 0.15, .04 and .20 which leads me to believe that we can potentially use all the data in the minors to predict a hitter’s power potential in the majors. It is also encouraging that we see the progression, since we’d expect to have the noise increase and the signal decrease the farther away from the majors we get. To build this simple model, I did a simple adjustment in which I made every level in the minors equal and threw that against the major league number to come up with this:

MLB HR+FB Distance to Minors HR+FB Distance (Batters) | R2 = 0.54

Distance (MLB) = 0.68 * Distance (MiLB) + 87

mlbminors

Projections!

Let’s take this model and project out all minor league hitters for whom we have fewer than 25 fly ball outs plus home runs recorded at the major league level.

 

TOP POWER PROSPECTS AA & AAA, Born After Jan. 1st 1990
batter_name Year of Batter DOB Distance in Minors Proj MLB Distance
Dylan Cozens 1994 331.17 312.20
J.D. Davis 1993 323.75 307.15
Joey Gallo 1993 322.33 306.19
Matt Chapman 1993 320.03 304.62
Daniel Palka 1991 318.53 303.60
Peter O’Brien 1990 316.74 302.38
Adam Walker II 1991 314.95 301.17
Derek Fisher 1993 314.46 300.83
Rowdy Tellez 1995 313.24 300.00
Clint Frazier 1994 313.01 299.85
Tom Murphy 1991 311.80 299.02
Tyler O’Neill 1995 311.44 298.78
Telvin Nash 1991 311.34 298.71
Kevin Cron 1993 309.87 297.71
Chase McDonald 1992 309.84 297.69
Jon Kemmer 1990 309.07 297.16
Gabriel Quintana 1992 308.41 296.72
Juan Duran 1991 307.99 296.43
Ryan O’Hearn 1993 307.47 296.08
Jacob Schrader 1991 307.32 295.98
Mac Williamson 1990 306.84 295.65
Steven Moya 1991 306.50 295.42
Taylor Sparks 1993 306.40 295.36
Rhys Hoskins 1993 306.18 295.20
Cody Bellinger 1995 305.82 294.96
Bradley Zimmer 1992 305.30 294.61
SOURCE: MLB Advanced Media
Data are for trajectories flagged as fly balls, filtered to those caught or hit for a home run. Base hits are excluded since they are recorded where they are fielded rather than where they landed.

I linked some of the top guys to their FanGraphs page (and one guy to his NBA page) so you can read more about them. Rowdy Tellez jumped out at me from this list due to his combination of youth and success. Dylan Cozens not only hit 40 homers last year, but also tops this list and received a plus-plus power rating from Eric Longenhagen; safe to say the dude has legit power, the question being whether it will play in the majors. Based on the observations above, if he does make the majors, he stands to be a very very good player.

If you’ve never heard of Kevin Cron, shame on you – FanGraphs has a sum total of zero articles written about him; these data may suggest there is some heretofore unbeknownst upside to the 6-foot-5, 245-pound first baseman. Tyler O’Neill is intriguing for his power/youth combo, was written up by Paul Sporer and is beloved by KATOH. Daniel Palka wasn’t very highly thought of by Dan Farnsworth, however, KATOH had him at No. 93, which indicates he may be one of those guys whosneaks up on scouts.

Tom Murphy strikes me as an interesting fantasy pickup, since he’s a catcher in Colorado with minor league distance numbers that suggest he can mash. Taylor Sparks might seem like a good bet, but he posted a 61 WRC+ in Double-A with a 34% strikeout percentage so he’s a poor man’s Joey Gallo, which is not a good thing. Given the glowing scouting report by Eric Longenhagen, backed up by solid distance data here, I’d be inclined to believe in Cody Bellinger’s future a little bit more than I otherwise would have.

Concluding Thoughts

The data are extremely compelling in their depth and surprising quality; we continue to see signals in places where we would expect to see signals and none where we wouldn’t expect any. What the above data suggest is that top prospects who back up their numbers with good flyball distance become very good major league ball players. Using distance cuts through the noise of ballparks to an extent, which may be why it produces slightly stronger signals than just using outcomes such as homers As this develops to include park factor adjustments, pull percentage and quality of opposition adjustments, we may be able to paint a more robust prospect picture, simply by scouting the inor league pitch-by-pitch stringer data.

References & Resources

This research was largely influenced and inspired by Chris Mitchell’s KATOH. KATOH uses aggregate data which are far more reliable than granular, manual minor league pitch-by-pitch data, which may create discrepancies in our results. Additionally, the pitch-by-pitch data used herein are constrained to the 2008-2016 time frame at the maximum and shorter time frames for certain metrics.

 


Eli Ben-Porat is a Senior Manager of Reporting & Analytics for Rogers Communications. The views and opinions expressed herein are his own. He builds data visualizations in Tableau, and builds baseball data in Rust. Follow him on Twitter @EliBenPorat, however you may be subjected to (polite) Canadian politics.
11 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Nick
7 years ago

“I’d be inclined to believe in Cody Bellinger’s future a little bit more than I otherwise would have.”

I see your linear model has made a great discovery in scouting. Send this to the Diamondbacks so that they will acquire Bellinger for Cron.

MGL
7 years ago

Eli, good work. What we really want to know is whether FB distance is (significantly) BETTER at predicting power than other rates (like HR or x-base hits). You mention that the r^2 is only slightly better for distance than for HR/BIP which suggests that maybe it isn’t. I’d like to see a multiple regression r^2 (i.e., does FB distance add anything once you know HR/BIP and vice versa) or any other statistical analysis which sheds light on that question.

I would also like to see MLB HR/BIP for 2 groups: One where the HR/BIP in MiLB is the same but the FB distances are low and high. Then do the same for the same FB distances but the HR/BIP differ. This is just a “poor-man’s multiple regression” (where we compare the coefficients) but I always like to do it to get a more transparent look at what’s going on.

IOW,

Group I:

HR/BIP is, say 6-8% (low HR) in the minors but FB distance is say, >260.

Group II:

HR/BIP is, also 6-8% (low HR) in the minors but FB distance is <260.

Now we look at MLB distance and HR rate for the two groups. If distance adds information we should see significantly different HR rates for the 2 groups.

So the same thing for similar distances but different HR rates in the minors. If those also result in significantly different HR rates (and distance perhaps) in the majors, then that's evidence that HR rate adds important information.

I suspect that either one will predict power in the majors but that using both is not much additional help. IOW, r^2 using both is not going to be much higher than r^2 for each one individually (around .5 using the PA cutoffs you did).

MGL
7 years ago
Reply to  Eli Ben-Porat

Remember that just predicting FB distance is not an end point. The end point is predicting HR or other offensive rates. Presumably FB distance on the majors IS predictive of some offensive rates but we are not 100% certain. So ultimately if minor league FB distance is to be interesting, it has to be better (than minor league offensive rates like HR/BIP) at predicting some major league offensive rates and not just major league FB distance.

So if you find that a MARS regression raises the R^2 for predicting major league FB distance but not for predicting HR/BIP or any other offensive rate, then it really isn’t useful for anything.

MGL
7 years ago
Reply to  Eli Ben-Porat

Interesting. Good work. FWIW, I realize the importance of a “pure” measure like distance. However, as you say, it ultimately only has utility to the extent we use it to predict some aspect or offense. One must be a little careful in going from distance to HR (or x-base hit) rate. The relationship is likely not linear and might even be a hinge function as your MARS model suggests. The reason it “thinks” only distance above a certain amount are predictive is simply because distances below a certain amount cannot be HR in any stadium and in any conditions. There is no difference in terms of HR/BIP prediction btwn a 200 foot FB and a 250 or 150 foot one. There is however, a big difference btwn 370 and 400. And that these are means you are working with throws another wrench into the analysis. All HR are tail ends of a player’s distance distributions. I’m not versed nearly enough in stats to know how that manifests itself in terms of a regression function (of mean distance on HR rate) or even what the typical distance distribution looks like (I assume it is normal).

obsessivegiantscompulsive
7 years ago

Mac Williamson!!!

pelgudmir
7 years ago

The development of power has always been a strange thing, to me. I know that power is a tool that develops late for a lot of players, but it is still strange to me that we don’t see more guys putting up big power numbers in the minors. I suppose guys with big big power are selected out and never play in the minors long enough to hit their potential numbers. Another thing… It seems like when someone hits .290 with 15 home runs in the high minors in their early twenties, we’ll often see them post a monster year in the majors where they far outperform their previous homer numbers – is power suddenly developing, are pitchers just supplying that much more of the power, is the ball that much more lively in the majors, or what? Just curious on the general consensus.

MGL
7 years ago
Reply to  pelgudmir

Ball construction as well as parks do differ, but the main reason for the dynamics you mention is what you said. Power develops late for most players, based on the development of size and strength (and to some extent learning the strike zone, etc.). Most importantly, as you said, a player who had prodigious power in the minors would most likely be in the majors. There is a huge selective element. Put Bautista and Stanton back in the minors and they would hit 50 HR a year I think. Also, there is a limit to power regardless of the quality of the pitcher.

obsessivegiantscompulsive
7 years ago
Reply to  pelgudmir

This comment brings up a good point about age, and its affect on power and thus HR’s. I don’t know how easy it is for you to implement this with this study (perhaps your next one? As this is pretty interesting start to studying this effect), but I wonder how including age into your regression would affect your projections.

As excited as I was to see a Giants prospect in the list (given our desire for power), Mac was one of the oldest prospects on your list. Adjusting for age could move him from near the end of the list to below, possibly, as younger players bump up higher as he bumps down lower, due to age.

Then again, marginal gains due to age levels off, so perhaps including age would more push up the youngest players on the list, like Tellez, O’Neil, Bellinger, than push down the older prospects.

I think this would be an interesting variable to include into your mix for this excellent study, I look forward to your follow ups, whenever they may come.