Saturday, November 21, 2009
Building a Retrosheet database, the short formPosted by Colin Wyers
Today marks the latest release of Retrosheet's event files, including 2009, the 1952 AL and some corrections to other years. This is a guide written for someone who already has an SQL database (MySQL, that is) set up and is comfortable with it. I am in the process of putting together a set of tutorials for those with no SQL experience whatsoever, but I've made a lot of changes to how I set up a database and I know some people would appreciate having those. And mind, this is a tutorial for Windows users. And make sure you have a lot of space free on your hard drive.
You need to install wget and 7-zip. I have made the presumption that you are installing them to C:\Program Files\. If you are not, you will have to edit some batch files.
By the same token, I am assuming that everything else goes in the folder C:\Retrosheet. If you put it somewhere else, you will have to edit some batchfiles. I have created a zip file that contains everything you need. Simply place that file in the C:\Retrosheet folder (you may have to make one first) and use 7-zip to unzip the file there. That file contains EVERYTHING you need except the event files themselves. That includes a copy of Chadwick version 0.5.2, which is used to parse Retrosheet's event files.
WARNING: The database schemas in here will only work with the version of Chadwick provided. You will get errors (or corrupted data) if you try to use an older version of Chadwick to parse these files and then load them into the database.
Go to C:\Retrosheet\data\zipped and run the get_zip_files.bat file. (If you have installed 7-zip or wget somewhere other than the default, you'll need to edit this file.) This will download all the zip files from Retrosheet's website and extract the contents. Then go to the folder C:\Retrosheet\data\zipped, which should contain the files you just extracted. There are three batch files in there:
Run those to parse the event files through Chadwick.
Okay, time to load that database. A warning - a lot of these queries will take a long time. The first thing to do is run retrosheet table schema.sql. That will wipe out anything you have in any database named "retrosheet," so if that's important to you, make a backup before starting. Then, you will find three files in the loaders directory:
Run those. (In SQLyog, that option is called "Restore From SQL Dump," under the Tools menu.) They will populate the database. (Note - if you are using a different directory structure, you need to edit these files first to reflect that.) Then, run the partition.sql file - that will "partition" the database to make it easier to load into memory. This will dramatically speed up queries. (After you've run partition.sql, you can feel free to delete the tables events_bck and games_bck.) Finally, run lookup_codes.sql to load the lookup tables (the ones that will tell you what the different codes mean).
Thanks to Tangotiger and Mat Kovach, whose work this is based on, Ted Turocy for providing the Chadwick tool set, and Dave Smith and all the volunteers over at Retrosheet for their hard work.
I hope this is useful to you. Let me know if you need help or if something doesn't work right. And again, if all of this is a bit more advanced than you're prepared for right now, I am working on a set of more indepth tutorials that should get you started - look for them sometime this week, hopefully.
UPDATE: Last time I posted a Retrosheet tutorial, I included some run expectancy code. Quite frankly, it was pretty poor. This is much better. It includes code for figuring linear weights as well. It is very compute-intensive, though, and requires roughly four times the code of the old method. I think it's worth it, but then again I'm kind of an obsessive on the topics of run expectancy and linear weights.
Colin Wyers knows exactly how much of a nerd he is. He is very interested in hearing about any other concerns you may have; you can reach him by e-mail, and he will try his best to respond in a timely fashion. He also blogs at Statistically Speaking.