For someone who writes about fantasy baseball, ADP (Average Draft Position) is a fun statistic. For instance, doing something as simple as graphing ADP against itself can visualize some aspects of what occurs during a draft. This ADP data, by the way, are from Yahoo drafts for the 2008 season, meaning these drafts occurred before the season began.

The interesting part of this graph is not where the dots are located, but their distance from each other. Noticing how they are relatively bunched at the edges and less dense in the middle reinforces my sentiment in this article—that drafting in the middle rounds is the most difficult.

Fantasy baseballers cannot agree where to take players in these rounds and therefore few players end up with an average draft position in the 100s. Because it is more of a “who” to take rather than a “where” at the end of a draft, you end up with the clustering after the 200 ADP mark that you see.

Ostensibly the reason people drafted these players where they did is because of the stats these players accumulated in the previous year. Comparing a player’s 2007 numbers with his 2008 ADP can provide us with some insight into which of the fantasy stats we target the most in drafts. Before we get buried in numbers, though, let’s first look at some graphs starting with home runs, since I figure they will be an important determinant.

### Graphs

This graphs shows us that it is not imperative to hit a ton of home runs to be taken early, as depicted by the dots toward the lower left of the graph. Also, hitting around 25 home runs seems to be the magic number to get a hitter out of the 200+ ADP cluster and from there a nicely defined linear slope brings us to Alex Rodriguez‘ 54 home runs in 2007 and his corresponding 1.2 ADP in 2008.

Next we will look at stolen bases, which might present a graph that looks radically different from the plateau-shaped home run graph.

This graph actually looks somewhat similar to the home run graph; it features the same basic shape except with more players on the left extreme and fewer to the right one. Simply looking at the graph, though, the dispersion appears more random, whereas on the home run graph there was a more visible downward slope.

Even more random than the stolen bases graph is the one comparing batting average to ADP.

Since batting average is a rate stat, I increased the at-bat threshold to 400 to eliminate possible fluky batting averages attained over a couple of hundred at-bats. Despite that, a player’s batting average appears to have a small effect on where he is drafted. Intuition tells me there must be some degree of correlation, but compared to home runs and stolen bases it appears to be small.

Last we will look at the graph of runs, which appear to correlate well with next year’s ADP, although later we will find out that may not be the case.

As you can see there is a well-defined, generally downward slope to the right, suggesting a correlation. Sometimes with graphs looks can be deceiving, as the next section will show.

### Regression

Looking at pretty graphs is nice, but let’s not get distracted from the purpose of the data. What the data can tell is which of the five main fantasy stats have the largest impact on where a player gets drafted in the following year. For this I used a multivariate regression, two multivariate regressions actually—one using the stats as counting stats with average converted to hits, and the second with them as rate stats, so for example home runs became home runs per at-bat. The results of the regressions are summarized in the following tables.

+-----------------------------------+ | ~ COUNTING ~ | +------+--------------+-------------+ | Stat | Coefficients | P-value | +------+--------------+-------------+ | Int. | 370.6356 | 1.6768 E-31 | | R | -0.3829 | 0.34664 | | HR | -2.2503 | 0.0093 | | RBI | -1.1258 | 0.0030 | | SB | -2.1020 | 8.6056 E-07 | | Hits | -0.3875 | 0.1504 | +------+--------------+-------------+

For the coefficients column, a lower coefficient means the stat is more significant. So in counting form home runs edge out stolen bases as the most significant with runs and hits the least important. The “P-value” column shows the significance of the coefficient with anything under .05 statistically significant, meaning home runs, RBI, and especially stolen bases pass the significance test. As I hinted before, runs were extraordinarily insignificant compared to the other stats.

+-------------------------------------+ | ~ RATE ~ | +--------+--------------+-------------+ | Stat | Coefficients | P-value | +--------+--------------+-------------+ | Int. | 550.6223 | 4.2287 E-24 | | R/AB | -97.9406 | 0.6615 | | HR/AB | -1523.8494 | 0.0019 | | RBI/AB | -608.0605 | 0.0045 | | SB/AB | -1578.2072 | 7.5421 E-10 | | AVG | -833.7461 | 2.9494 E-05 | +--------+--------------+-------------+

Once again home runs and stolen bases jump out as the big players, with not surprisingly batting average rising in importance since this is its home court, so to speak. And once again runs display their general lack of relevance.

The one part of these charts I have failed to mention yet is the coefficient of the intercept. The fun activity you can do with these is create a rough estimate of where a player will be drafted given his stat line for a season. Multiplying a player’s stats in each category by its coefficient, adding those numbers up and then subtracting from the intercept coefficient will generate a rough estimate of that player’s ADP. For example if you took Todd Helton‘s 2007 line of 86 runs, 17 homers, 91 RBI, no stolen bases, and 178 hits and plugged it in:

Estimated ADP = 370.6 – (86 * .3829) – (17 * 2.25) – (91 * 1.1258) – (0 * 2.1) – (178 * .3875) = 128.5

Helton’s estimated ADP of 128.5 is remarkably close to his actual ADP that year of 135.4 given the crudeness of the model (using only one year of data from one website) and the fact that it does not take into account any positional adjustment. This model worked well for this set of data with an R-Squared of .8, but that is not overly surprising considering the model was created off the 2007 season-2008 ADP data. At this point this ADP model probably will not work tremendously well for the 2009 season stats, but given a few more years of data added it could become an interesting tool for leagues that draft early in the offseason, or for some historical context on a player’s ADP.

### Concluding thoughts

I know this article does more of confirming what we might have already suspected—that home runs and steals are the most significant when it comes to determining ADP—instead of providing us with new information, but there still are lessons to be taken away.

First, the insignificance of runs in the regressions points to a possible inefficiency in the fantasy marketplace. People most likely assume runs are a byproduct of other skills and ignore them when ranking players. A system that would take into account position in batting order, team runs per game, and of course the player’s skill level could more accurately predict expected run totals and make rankings more accurate.

The xADP model I debuted is something that could become a powerful fantasy tool given a few more years of ADP data, and hopefully you saw a glimpse of that.

I’ll end with a confession and display of gratitude to colleague Nick Steiner, who ran the multivariate regressions that spewed out the coefficient values that were instrumental to this article. I am more statistically illiterate than you might assume and do not have the savvy to run such regressions. I owe a big thanks to him for his time and effort.

Kyle said...

Great blog… Why do you say the 2009 stats will not work in this formula? How did you come up with the coefficients? I have 10 batting and 10 pitching categories and would like to determine what I should be going off of. I would assume it would look similar, but would like to know how some of the other categories affect our league. Thanks

Millsy said...

I think this is an interesting, but do you have any qualms about the fact that ADP is EXTREMELY non-independent (a crucial assumption in running a regression). ADP is a rank-based measure, and I’m not convinced a simple multivariate regression is sufficient. For every one place someone moves up, another moves down.

In addition, the multi-collinearity could be a problem for the ‘Runs’ measure, resulting in your strange p-value for that coefficient, despite the obvious relationship in the scatterplot. I’m sure Runs are a secondary component of ‘skill’, but if that’s the case, then I’m not sure it’s all that useful in the regression (but I leave that up to the person putting it in, and the correlation between Runs, HR, and AVG).

It can still be kept in there for predictive purposes; however, making inferences about the coefficient would be troublesome (and it may become erratically changing when something in the model is changed slightly). Multicollinearity is an over-hyped problem in my opinion, but it almost definitely poses one here with your Runs coefficient.

Millsy said...

Sorry, one more qualm. Since ADP is obviously truncated at Pick 1 (and arguably capped depending on the number of players needed for your league), did you run any sort of Tobit model, or just a simple regression? If it’s a simple multivariate regresison, then the coefficients are going to have problems at the extremes of ADP (and likely consider Albert Pujols a negative ADP).

Jeff Z said...

One point on the wide gaps in the middle game could be teams filling up positions later and reaching. What might be nice is the ADP when each team has drafted a position.

Nick Steiner said...

Millsy,

I was the one who helped Paul with the regression part of this, so I can probably answer your questions.

1) Yes, I understand that’s a pretty big problem with the fact that ADP is a rank based system instead of a value. However, I wasn’t really sure how to get around that. Do you have any suggestions?

2) I think runs are obviously going to have a ton of multicollinearity. Runs are basically the bi-product of home runs, stolen bases and OBP (which is generally going to mirror around batting average). BTW, if I run the the regression using rate stats without Runs/AB, I get the same R-Squared and similar coefficients, so it appears that it really doesn’t have much of an effect.

Millsy said...

Thanks for the response, Nick.

I’ve been looking into some methods, as we had some plans to mess with this idea over at Fantasy Ball Junkie. I think it’s a really neat tool to use in general, and there’s lots of room for improving it. Right now, I’m not sure exactly how to get around the dependency issue in the ranks (though I’m looking into it this next week or so). However, I don’t think it completely damns the use of regression here.

As for the runs, our editor at FBJ and I have been discussing the best way to deal with them. I’m curious if there would be a difference (in the R coefficient itself) if you used Runs – HR, or ‘HR independent Runs’. Or maybe another variant of that with Hits (since the correlation there is really high). I don’t know the answer, honestly, and the issue with multicollinearity may not be a big one, as I said it often gets overemphasized.

It’s very interesting that it doesn’t give us much information itself, and I know some people have found this same thing about runs in the past. So, in the end, I could be 100% wrong and Runs truly are undervalued in fantasy. I’m curious if that is a psychological thing going into the draft, as people for some reason don’t emphasize small run differences between players enough.

Just curious about the censoring problem with ADP, too. That’s a pretty easy improvement I think. If we take the Pujols lines from 2007 to 2009 we get ADP’s of:

2008 xADP: 73.37

2009 xADP: 49.33

2010 xADP: -27.91

(I was simply plugging in different seasons, and I understand the model isn’t built for later years)

Or, as another example, Ryan Howard looks like this:

2008 xADP: 18.64

2009 xADP: -3.34

2010 xADP: -13.04

the 2008 looks pretty solid, and maybe things change when you use the 2008 and 2009 data instead of 2007 for the entire regression estimate, but I think a truncation is necessary.

Overall, I really like the idea!

John K said...

Good article – I like the idea of this piece a lot.

It’s not correct to simply look at the magnitude of the coefficient and make a claim about a regressor’s “importance.” For instance, I could apply a monotonic transformation to your RBI number, get exactly the same significance and R^2, but a larger coefficient. The fact that the mean RBI/AB is higher than that of HR/AB will influence the magnitude of your coefficient. That is not sufficient for a conclusion that HRs are more “important” in determining ADP.

Perhaps it would be more helpful if you multiplied the coefficient by a one standard deviation change in each of the (significant) regressors. I might also suggest a series of F-tests.

John K said...

I’m saying your interpretation is off, not just the terminology.

Separately, skimming over the other comments I think the issue of the negative ADP forecasts is a false concern. All you need from a forecast of ADP is an ordinal ranking. Why you would want to use a regression to forecast ADP is another question!

Millsy said...

John K,

It’s not a false concern, IF you’re going to use a standard regression (assuming that’s a wise choice), given the obvious censoring of the data. If the idea is just to take the negative predictions and then ordinally rank them 1 to whatever, then fine, but you’re still going to get screwey coefficient estimates that likely aren’t correct. If our interest is in the coefficients as well as the ADP, it’s absolutely a legitimate concern.

There’s the issue there as well that the coefficients should NOT be the same across the entire sample. Jumping from 20 to 5 is going to be different than jumping from 100 to 85.

Paul Singman said...

Kyle—This model will not work well for 2009 stats because it was created using the 2007 season stats and 2008 ADP data. Ideally a model with more predictive value would be based off of multiple years of data. The coefficients were derived by applying a multivariate regression to season stats and ADP data. This is not something anyone can do (myself included) I suppose it is something you would learn in a stats class.

Millsy—Unfortunately I cannot address your concerns regarding some of the finer points of the regression and model, however your wording when you say “ADP is a rank-based measure, and I’m not convinced a simple multivariate regression is sufficient.

For every one place someone moves up, another moves down.” makes me feel you might be misunderstanding how ADP is calculated.To the best of my knowledge sites like Mock Draft Central, ESPN, and Yahoo (where I got this data from) calculate the numbers by actually taking the average of when a player is drafted, not by ranking each player based off when he is drafted. What this means is that one player can have his ADP increase without necessarily having a negative effect on another player’s ADP. Of course a player of group of players have to absorb the increase of another, however, if one player moves up one spot, another player does not necessarily move down one spot in response.

John K—I understand my choice of words when interpreting the results of the regression might not have been the most accurate. When I take a statistics class next year I’ll get back to you

Derek Ambrosino said...

I’d like to offer a simpler hypothesis regarding the importance of SBs and HRs, and lack of importance of runs. (Although, I recognize the multicolinearity issue as well.)

I think scale/volume of the stats has something to do with it as well. Teams, as a whole, score a lot of runs and drive in a lot of runs; they certainly accrue way more of each than homers (it’d be impossible not to, I know) and, even moreso, steals. So, when drafting/bidding, owners are less likely to pay a lot of mind to a small difference in runs or RBI. People don’t usually say, wow my team is runs-deficient; I need a runs guy.

Is that shortsighted and naive? Perhaps. But, at the same time, it is hard to guarantee yourself a high finish in runs or RBI. Mostly, winning these categories is the result of an all-around high quality offense without any holes/ On the other hand, if you made it a point to focus on dominating steals and homers (or saves) in the draft, it would be much easier to do.

Now that I actually type this out, it strikes me that this reason is really just a much more implicit and organic understanding of multicolinearity that fantasy owners at large seem to possess even if they don’t know the underlying, ten-dollar term for the dynamic they essentially reacting to.

B N said...

I think the predictability of runs is definitely a big factor in why people don’t consider them very highly. Your ability to score runs is going to be a function primarily of two things: your OBP and who is batting behind you. One of those factors is something that can be taken into account, as it has to do with player skills. The lineup position is a crapshoot though. With rookies, and even with veterans, you can always end up with a case where they get dropped into the bottom of the lineup. In most cases, batting 5th or 6th could be the kiss of death for any nice run totals.

Moreover, since both runs and RBI are lineup position dependent and effectively inverse, you end up with an even bigger mess. If you target runs, you’re generally taking hits on RBI. This makes targeting R or RBI independently a strange thing. It seems to me that you’re mainly paying a premium for a lineup position on a particular team. Not to say there’s anything wrong with that, but lineup position and midseason trades are both factors that are not easily projected. They seem more like things you want to do as modifiers on stuff like the value OBP and slugging.

And besides, since runs are generated by top of the lineup guys- I’ve usually found they’re pretty easy to make up for in any shallow or even medium league. It’s kind of like closers- there are guys rotating through the leadoff spot getting runs all year, due to injuries, etc.