Since Joe Adler’s Baseball Hacks was published and people started scraping data from the MLB gameday site, the language of choice has been Perl. Before I got that book I had never used Perl before. I’m not a programmer by trade, and am most familiar with Visual Basic, which I use to do most of my baseball stuff. Using the Perl program to fetch the xml files and save them to my hard drive was easy enough, but I could never figure out the program to parse that into a database. When it didn’t work, I really had no idea what to fix. So I came up with my own solution.
The data are stored on my hard drive in this format: Year/Month/Day/GameID/Inning/. There are several files in the GameID folder, but the ones I’m interested in are the innings files in the inning subfolder. They are named inning_1.xml, inning_2.xml, etc. What I’m going to do takes two steps: First I want to bring every inning file into an excel spreadsheet, add a column to add the gameID identifier, and then save all the inning files from each day into a daily spreadsheet. Once I’ve done this, I want to grab several of the day files and consolidate them so I can import into an Access database. You probably don’t want to try getting all the data into one sheet at the same time, because we are talking about hundreds of thousands of rows here. If you don’t have excel 2007, you will face an upper limit of around 65,000 rows anyway. Excel 2007 goes up to a million rows.
The first program I use is called get_xml_innings.xlsm. If you aren’t familiar with the new excel, .xlsx replaces the .xls extension, and .xlsm is a workbook that allows macros. The first thing you need is an xml map. Excel can create one for you, follow these steps:
- Go to the developer tab, xml section.
- click import, and navigate to a directory with the innings xml files. You will get this message: “The specified XML source does not refer to a schema. Excel will create a schema based on the XML source data.” Click OK, and OK to the next message, and your inning files will be placed in sheet1.
- Click on Map Properties (In the developer tab). Under “When refreshing or importing data” change it to “Append new data to existing XML tables”. This way you can bring in 9 innings (or more) per game, and multiple games per day.
- Delete from row 2 to the end of your data, all you need to leave in place are the headings in row1. You could always continue to import inning by inning, but macros do this job much better and faster.
- The macro runs in a simple form, so create a userform1 with two objects in it, a label called label1 and a button called commandbutton1. Change the names to less generic terms if you want, but if you do you’ll have to modify the code. I’ve pasted the code into a text file, you can get it here.
- I generally do one month at a time, and set the days loop to the number of days with games that month, so for April 2009 days would be from 6 to 30.
The program looks in C:\perlprograms\games\year_2009\ for your files, modify this if your games are elsewhere. Once it finishes a day’s work it will save your new files in this folder: c:\perlprograms\pbp_innings\year2009\month_04\.
The program does two things to make the files easier to work with: The xml files when opened into excel have the visiting team events in columns 5 to 51 and the home team in 59 to 105, I move all the data into columns 5 to 51 and put an “a” or “h” in column 59 to identify which team was up. And I put the game ID in column 4. This originally has a heading of “next”, and is yes when there is a next inning, and no when there is not (like bottom of 9th). I didn’t think that column was needed for anything, so it seemed as good a place as any to put the game ID.
The next macro to use is months.xlsm. This program takes the daily files you created above and brings them in, in sheet1, so that you can import into Access. Even a month of data will have over 100,000 rows, so if you don’t have excel2007 you may need to do partial months. The way I did it is one month at a time, then save the file, import into an Access table, then delete all the rows in months.xlsm (except the first – your column headings) and start on the next month. With the minor leagues there is less data (no pitchfx) so I was able to bring all the days into my file in one shot. You will need column headings for columns A to BG, make sure these are exactly the same as the fields in your Access table and you should have no trouble appending data. Here is the code I use for the months file.
Good luck, and I offer no guarantees this will work for you, and my customer support line will be closed at all times (you get what you pay for). But it worked for me, and if you are more familiar with Microsoft tools than Perl, it might work for you.