Capacity building 101: so, tell me more about this database

by Derek Ambrosino
April 28, 2010

In last week’s column, I mentioned the idea of major fantasy sports providers instituting a census function and keeping a database of league performances. Most of those who reacted to the idea agreed that this would be a useful tool, so I thought I’d expand on the idea a bit and flesh out what how it might work and what I think it should provide. I’ve had this idea for a while (I assume others have too), so I’ve given it some thought.

First of all, I think this should be an optional feature. A league’s commissioner would choose whether he wants to enroll the league in the program. In fact, I’m inclined to think that only private leagues should be able to opt in. My reason for this is that I want to keep the quality of the data as high as possible. The nature of this data dictates that it is most likely to be used by serious fantasy sports participants, so I’d like to have at least some filtration of the data. I want to minimize the amount of data coming from leagues where somebody drafts Matt Kemp in the fourth round and a third of the teams aren’t even rotated on a regular basis. So, my broad sweeping assumption is that private leagues are generally of a higher quality than public leagues.

By opting in, your league’s settings are recorded and the system begins banking data about rosters, drafts and scoring. When using the database, the user would just input the preferred format in a series of drop-down menus: player universe, roto vs. head-to-head, number of teams, roster size/starting positions (there’s probably more variance here in pitcher starting roster set-up than batting, so I’m inclined to just have the system not distinguish between SP and RP and simply ask for the number of active pitching slots), etc.

On a side note, I really don’t understand why there’s an option to differentiate SPs and RPs in the first place and I encourage everybody who will listen to me to set up their leagues such that all the pitching slots are simply Ps. I mean, this is a totally artificial distinction; “starting pitcher” and “relief pitcher” are not real positions. Nothing is to stop a team from having nine guys pitch one inning each, so I don’t see why a fantasy league would issue mandates that owners own a minimum number of different types of pitchers. It’s no more sensible than having slots designated for righties and southpaws. OK, guys, rant over.

Anyway, here are a few things I think the system should track and why those things might be useful for the fantasy universe to know.

Scoring. For me, I think this is the most important vein of information to be gained from this hypothetical tool. Here are a few important questions that we could gain insight into:

How many points/what record does it take to win the average league of your size/structure?
How is the scoring distributed? Are some categories more clustered relative to others (and the overall supply of those stats)?
Are there patterns about the relative strengths and weaknesses of good-performing and poor-performing teams?
If I’m aiming for the 10 out of 12 across the board strategy, what benchmarks should I be shooting for per category?

Player ownership. Perhaps there isn’t a much to be gained from learning things like which players were most often found on championship teams and who was found on losing teams, but paired with some draft information it could be worthwhile to know these things.

Draft info. Some owners’ player acquisition strategies are driven very heavily by positional scarcity. Is, for example, forgoing first basemen earlier in the draft in favor of middle infielders a strategy common to winning teams? Of course, by establishing this database, this would allow the providers to publish their own ADP data.

There are some problems with this proposal, I’m aware. One of the main questions is whether there are too many junk leagues that will muck up the data. This is a question I’m not really sure about. I do feel like the majority of people who I meet randomly and start talking about fantasy baseball with seem to have no idea what the hell they are talking about. (If I had an agent, I assume he’d advise me against sharing this opinion with the public, as a fantasy baseball columnist.) A minimal step to try to improve the quality of the data would be to keep public leagues out of the process, but beyond that, I’m not sure how to “gatekeep.”

Another potential problem might me the myriad subtle variations of league and roster structures that could slice the data really thin. Again, I’m not certain this is a substantial problem, just that it has the potential to be. I think negating the distinction between SPs and RPs is a good first step. Perhaps you could ignore distinctions in bench spots and only focus on active roster spots, if the data started getting cut thin. But these would be kinks to work out once the evaluators get to see what they are working with. At the very least, it would be interesting to see how popular different league formats really are.

Got what you think is a fairly easily implemented idea for a valuable tool to advance the analysis of fantasy baseball, or suggested additions or criticisms of mine? Let’s hear ‘em.

4 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Matt Levy

13 years ago

Derek,
Your column is great, but I’m pretty sure you don’t have to worry about offending the people who seem like they have no idea what the hell they are talking about. More than likely, they don’t come to THT. If they did, maybe they’d know a little more.

I think that splitting the data between private and public is a good idea. No harm in getting both info. And maybe there should even be a combined one to show the variance.

sean

1. I’d love to see the data divided by weekly and daily transaction leagues. Streaming seems to be a really rampant strategy in my daily leagues and really affects drafts and the free agent market.

2. I would think that there should be consideration paid to public cash leagues (CBS, Fanball, etc). These participants are incredibly serious although randomly matched in a public league.

3. I hate to be a pessimist, but I have a strange sinking feeling that whatever advantage may be gained by way of preseason player evaluation for draft-day valuation purposes is going to be seriously mitigated, if not completely overcome, by unforeseen injury and luck-of-the-draw-type in-season pickups.

Jeremy

There is a hard question here, and an easy question. The hard question is, “Which data will be useful?”

The hard question is irrelevant.

The easy question is, “What kind of data should be included?” The correct answer is, “Anything you can think of.” This is not the 1980s—a few extra kB of data will not smash the database. It is practically certain that you will end up including many things that you find useless. It is also practically certain that you will find things that you thought you would never use that turn out to be gold mines. (Imagine trying to convince a fan in the 1970s that pitch counts are important.)

Then, when it comes time for each individual analysis, you can pick and choose which data to keep. For some things, a large sample size is needed. For others, noise reduction is worth it even if it means a smaller sample size. If you keep the data, you can make this decision later, in the best way for each analysis. Any information you throw away now, you can never get back.

It works (mostly) for the particle physicists.

Andrew

Pretty cool to hear you on BaseballHQ Radio, Derek.

BaseballHQ meets The Hardball Times – a killer duo.

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG