|
September 2, 2010
THT Essentials:
Most Recent Comments
One day left to win an Annual! (4)
The White Sox owe their fans an apology (21) Twenty years ago today (9/2/10) (6) Welcome to Chapmania (2) Relievers compared to Strasburg (2) ![]() ![]()
Rich Barbieri
John Barten Brian Borawski Kevin Dame Joshua Fisher David Gassko Brandon Isleib Chris Jaffe Brad Johnson Myron Logan Max Marchi Bruce Markusen Anna McDonald Dan Novick Harry Pavlidis Alex Pedicini Jeff Sackmann Nick Steiner Dave Studeman Steve Treder Tuck! David Wade Geoff Young And here's the full roster.
Or you can search by:
Dish Satellite TV
Quality sports programming from a satellite industry leader. Gear up for baseball season with Chicago White Sox tickets and New York Yankees tickets. LA Angels tickets, Houston Astros tickets, and Atlanta Braves tickets are hot sellers! You can get Boston Red Sox tickets, San Diego Padres tickets or Chicago Cubs tickets for your favorite baseball fan. Coast to Coast Tickets has the best MLB tickets like Minnesota Twins tickets, LA Dodgers tickets, Milwaukee Brewers tickets, New York Met tickets and St. Louis Cardinals tickets. ![]() All content on this site (including text, graphs, and any other original works), unless otherwise noted, is licensed under a Creative Commons License. |
Monday, March 08, 2010Getting Pitch FX using Microsoft ToolsPosted by Sean SmithSince Joe Adler's Baseball Hacks was published and people started scraping data from the MLB gameday site, the language of choice has been Perl. Before I got that book I had never used Perl before. I'm not a programmer by trade, and am most familiar with Visual Basic, which I use to do most of my baseball stuff. Using the Perl program to fetch the xml files and save them to my hard drive was easy enough, but I could never figure out the program to parse that into a database. When it didn't work, I really had no idea what to fix. So I came up with my own solution. The data are stored on my hard drive in this format: Year/Month/Day/GameID/Inning/. There are several files in the GameID folder, but the ones I'm interested in are the innings files in the inning subfolder. They are named inning_1.xml, inning_2.xml, etc. What I'm going to do takes two steps: First I want to bring every inning file into an excel spreadsheet, add a column to add the gameID identifier, and then save all the inning files from each day into a daily spreadsheet. Once I've done this, I want to grab several of the day files and consolidate them so I can import into an Access database. You probably don't want to try getting all the data into one sheet at the same time, because we are talking about hundreds of thousands of rows here. If you don't have excel 2007, you will face an upper limit of around 65,000 rows anyway. Excel 2007 goes up to a million rows. The first program I use is called get_xml_innings.xlsm. If you aren't familiar with the new excel, .xlsx replaces the .xls extension, and .xlsm is a workbook that allows macros. The first thing you need is an xml map. Excel can create one for you, follow these steps:
The program looks in C:\perlprograms\games\year_2009\ for your files, modify this if your games are elsewhere. Once it finishes a day's work it will save your new files in this folder: c:\perlprograms\pbp_innings\year2009\month_04\. The program does two things to make the files easier to work with: The xml files when opened into excel have the visiting team events in columns 5 to 51 and the home team in 59 to 105, I move all the data into columns 5 to 51 and put an "a" or "h" in column 59 to identify which team was up. And I put the game ID in column 4. This originally has a heading of "next", and is yes when there is a next inning, and no when there is not (like bottom of 9th). I didn't think that column was needed for anything, so it seemed as good a place as any to put the game ID. The next macro to use is months.xlsm. This program takes the daily files you created above and brings them in, in sheet1, so that you can import into Access. Even a month of data will have over 100,000 rows, so if you don't have excel2007 you may need to do partial months. The way I did it is one month at a time, then save the file, import into an Access table, then delete all the rows in months.xlsm (except the first - your column headings) and start on the next month. With the minor leagues there is less data (no pitchfx) so I was able to bring all the days into my file in one shot. You will need column headings for columns A to BG, make sure these are exactly the same as the fields in your Access table and you should have no trouble appending data. Here is the code I use for the months file. Good luck, and I offer no guarantees this will work for you, and my customer support line will be closed at all times (you get what you pay for). But it worked for me, and if you are more familiar with Microsoft tools than Perl, it might work for you. Sean Smith is a lifelong Angels fan despite never visiting the west coast until April 2006. His work can also be found at baseballprojection.com and Anaheim Angels All the Way and he can be contacted by email. CommentsLeave a comment:Commenting is not available in this weblog entry. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||