Building a Retrosheet database, the short form

Today marks the latest release of Retrosheet‘s event files, including 2009, the 1952 AL and some corrections to other years. This is a guide written for someone who already has an SQL database (MySQL, that is) set up and is comfortable with it. I am in the process of putting together a set of tutorials for those with no SQL experience whatsoever, but I’ve made a lot of changes to how I set up a database and I know some people would appreciate having those. And mind, this is a tutorial for Windows users. And make sure you have a lot of space free on your hard drive.

You need to install wget and 7-zip. I have made the presumption that you are installing them to C:\Program Files\. If you are not, you will have to edit some batch files.

By the same token, I am assuming that everything else goes in the folder C:\Retrosheet. If you put it somewhere else, you will have to edit some batchfiles. I have created a zip file that contains everything you need. Simply place that file in the C:\Retrosheet folder (you may have to make one first) and use 7-zip to unzip the file there. That file contains EVERYTHING you need except the event files themselves. That includes a copy of Chadwick version 0.5.2, which is used to parse Retrosheet’s event files.

WARNING: The database schemas in here will only work with the version of Chadwick provided. You will get errors (or corrupted data) if you try to use an older version of Chadwick to parse these files and then load them into the database.

Go to C:\Retrosheet\data\zipped and run the get_zip_files.bat file. (If you have installed 7-zip or wget somewhere other than the default, you’ll need to edit this file.) This will download all the zip files from Retrosheet’s website and extract the contents. Then go to the folder C:\Retrosheet\data\zipped, which should contain the files you just extracted. There are three batch files in there:

  • $cwevent.bat
  • $cwgame.bat
  • $cwsub.bat

Run those to parse the event files through Chadwick.

Okay, time to load that database. A warning – a lot of these queries will take a long time. The first thing to do is run retrosheet table schema.sql. That will wipe out anything you have in any database named “retrosheet,” so if that’s important to you, make a backup before starting. Then, you will find three files in the loaders directory:

  • events.sql
  • games.sql
  • subs.sql

Run those. (In SQLyog, that option is called “Restore From SQL Dump,” under the Tools menu.) They will populate the database. (Note – if you are using a different directory structure, you need to edit these files first to reflect that.) Then, run the partition.sql file – that will “partition” the database to make it easier to load into memory. This will dramatically speed up queries. (After you’ve run partition.sql, you can feel free to delete the tables events_bck and games_bck.) Finally, run lookup_codes.sql to load the lookup tables (the ones that will tell you what the different codes mean).

That’s it!

Thanks to Tangotiger and Mat Kovach, whose work this is based on, Ted Turocy for providing the Chadwick tool set, and Dave Smith and all the volunteers over at Retrosheet for their hard work.

I hope this is useful to you. Let me know if you need help or if something doesn’t work right. And again, if all of this is a bit more advanced than you’re prepared for right now, I am working on a set of more indepth tutorials that should get you started – look for them sometime this week, hopefully.

UPDATE: Last time I posted a Retrosheet tutorial, I included some run expectancy code. Quite frankly, it was pretty poor. This is much better. It includes code for figuring linear weights as well. It is very compute-intensive, though, and requires roughly four times the code of the old method. I think it’s worth it, but then again I’m kind of an obsessive on the topics of run expectancy and linear weights.

Print Friendly
 Share on Facebook0Tweet about this on Twitter0Share on Google+0Share on Reddit0Email this to someone
« Previous: HR/FB Park Factors
Next: Introducing Visual Baseball »

Comments

  1. Nick Steiner said...

    Hey Colin –

    I followed your original series on StatSpeak, however, I can’t use ChadWick or any of the programs you need cause I’m on a mac. 

    Do you know how I would do this on a mac?

  2. Colin Wyers said...

    I don’t have a Mac, so everything I say should be taken with a grain of salt.

    wget is readily available for OS X:

    http://www.statusq.org/archives/2008/07/30/1954/

    Or you could just use curl, which I think comes with OS X. The command syntax for curl is a little different than wget, so you’ll want to check on that.

    A command-line version of 7zip is available for Mac OS as well:

    http://web.me.com/krmathis/

    There are other programs available for OS X that can handle unpacking ZIP files are. I can’t recommend any of them, since I don’t know anything about their relative merits.

    If you go to the Chadwick site, you can download source code with instructions on how to compile from the bash shell, which is bundled with OS X. (That’s how I compiled Chadwick, using mingw and MSYS to provide the shell and compilers.) My understanding is that OS X comes with the autoconf/automake toolchain preinstalled, but again I cannot verify that.

    At that point the only issue is the batch files, which are all simply a list of commands to execute. I know nothing about shell scripting on OS X, but I imagine it would be trivial to write counterparts for OS X.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Current day month ye@r *