February 10, 2010

Fangraphs Player Search:

Order Now


Get "The world champ of baseball annuals." The Hardball Times Baseball Annual 2010 features articles by THT's staff as well as Bill James, Tom Tango and Craig Wright and contains much, much more. Please support THT and use this link to purchase the Annual.


Get the fantasy book that everyone's raving about! Edited by THT Fantasy's Rob McQuown and Michael Street, and featuring our own Matt Hagen on prospects. Shipping now from ACTA!

Most Recent Comments


For daily results, visit the THT Daily Archive.


And here's the full roster.



Or you can search by:

Sports Tickets

Gear up for baseball season with Chicago White Sox tickets and New York Yankees tickets. LA Angels tickets, Houston Astros tickets, and Atlanta Braves tickets are hot sellers! You can get Boston Red Sox tickets, San Diego Padres tickets or Chicago Cubs tickets for your favorite baseball fan. Coast to Coast Tickets has the best MLB tickets like Minnesota Twins tickets, LA Dodgers tickets, Milwaukee Brewers tickets, New York Met tickets and St. Louis Cardinals tickets.
Find premium Chicago Cubs tickets and other Chicago tickets at JustGreatTickets.com.
Chicago Cubs Tickets
Chicago Tickets
Championship Tickets



Creative Commons License
All content on this site (including text, graphs, and any other original works), unless otherwise noted, is licensed under a Creative Commons License.
Roll mouse over date for entries
THT Live Calendar
February 2010
S M T W T F S

1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28





Saturday, November 21, 2009

Building a Retrosheet database, the short form

Posted by Colin Wyers
Today marks the latest release of Retrosheet's event files, including 2009, the 1952 AL and some corrections to other years. This is a guide written for someone who already has an SQL database (MySQL, that is) set up and is comfortable with it. I am in the process of putting together a set of tutorials for those with no SQL experience whatsoever, but I've made a lot of changes to how I set up a database and I know some people would appreciate having those. And mind, this is a tutorial for Windows users. And make sure you have a lot of space free on your hard drive.

You need to install wget and 7-zip. I have made the presumption that you are installing them to C:\Program Files\. If you are not, you will have to edit some batch files.

By the same token, I am assuming that everything else goes in the folder C:\Retrosheet. If you put it somewhere else, you will have to edit some batchfiles. I have created a zip file that contains everything you need. Simply place that file in the C:\Retrosheet folder (you may have to make one first) and use 7-zip to unzip the file there. That file contains EVERYTHING you need except the event files themselves. That includes a copy of Chadwick version 0.5.2, which is used to parse Retrosheet's event files.

WARNING: The database schemas in here will only work with the version of Chadwick provided. You will get errors (or corrupted data) if you try to use an older version of Chadwick to parse these files and then load them into the database.

Go to C:\Retrosheet\data\zipped and run the get_zip_files.bat file. (If you have installed 7-zip or wget somewhere other than the default, you'll need to edit this file.) This will download all the zip files from Retrosheet's website and extract the contents. Then go to the folder C:\Retrosheet\data\zipped, which should contain the files you just extracted. There are three batch files in there:


  • $cwevent.bat
  • $cwgame.bat
  • $cwsub.bat


Run those to parse the event files through Chadwick.

Okay, time to load that database. A warning - a lot of these queries will take a long time. The first thing to do is run retrosheet table schema.sql. That will wipe out anything you have in any database named "retrosheet," so if that's important to you, make a backup before starting. Then, you will find three files in the loaders directory:


  • events.sql
  • games.sql
  • subs.sql


Run those. (In SQLyog, that option is called "Restore From SQL Dump," under the Tools menu.) They will populate the database. (Note - if you are using a different directory structure, you need to edit these files first to reflect that.) Then, run the partition.sql file - that will "partition" the database to make it easier to load into memory. This will dramatically speed up queries. (After you've run partition.sql, you can feel free to delete the tables events_bck and games_bck.) Finally, run lookup_codes.sql to load the lookup tables (the ones that will tell you what the different codes mean).

That's it!

Thanks to Tangotiger and Mat Kovach, whose work this is based on, Ted Turocy for providing the Chadwick tool set, and Dave Smith and all the volunteers over at Retrosheet for their hard work.

I hope this is useful to you. Let me know if you need help or if something doesn't work right. And again, if all of this is a bit more advanced than you're prepared for right now, I am working on a set of more indepth tutorials that should get you started - look for them sometime this week, hopefully.

UPDATE: Last time I posted a Retrosheet tutorial, I included some run expectancy code. Quite frankly, it was pretty poor. This is much better. It includes code for figuring linear weights as well. It is very compute-intensive, though, and requires roughly four times the code of the old method. I think it's worth it, but then again I'm kind of an obsessive on the topics of run expectancy and linear weights.



Colin Wyers knows exactly how much of a nerd he is. He is very interested in hearing about any other concerns you may have; you can reach him by e-mail, and he will try his best to respond in a timely fashion. He also blogs at Statistically Speaking.


Comments

Nick Steiner said...

Hey Colin -

I followed your original series on StatSpeak, however, I can’t use ChadWick or any of the programs you need cause I’m on a mac. 

Do you know how I would do this on a mac?

Posted 11/22  at  12:18 PM
Colin Wyers said...

I don’t have a Mac, so everything I say should be taken with a grain of salt.

wget is readily available for OS X:

http://www.statusq.org/archives/2008/07/30/1954/

Or you could just use curl, which I think comes with OS X. The command syntax for curl is a little different than wget, so you’ll want to check on that.

A command-line version of 7zip is available for Mac OS as well:

http://web.me.com/krmathis/

There are other programs available for OS X that can handle unpacking ZIP files are. I can’t recommend any of them, since I don’t know anything about their relative merits.

If you go to the Chadwick site, you can download source code with instructions on how to compile from the bash shell, which is bundled with OS X. (That’s how I compiled Chadwick, using mingw and MSYS to provide the shell and compilers.) My understanding is that OS X comes with the autoconf/automake toolchain preinstalled, but again I cannot verify that.

At that point the only issue is the batch files, which are all simply a list of commands to execute. I know nothing about shell scripting on OS X, but I imagine it would be trivial to write counterparts for OS X.

Posted 11/22  at  01:57 PM
RZ said...

Awesome Colin!

Posted 11/22  at  05:40 PM
Tim said...

This is really great - thank you so much!!

Posted 11/23  at  01:00 AM
Page 1 of 1

Leave a comment:

Commenting is not available in this weblog entry.