Databases for sabermetricians, Part One

Stop me if this sounds familiar. You have a question about baseball, the sort of thing that should be answerable with stats, you think. So you go looking for the stat, but you can’t find anyone who computes it the way you want. So you find a site that (maybe) has the raw data you need, copy and paste it into Excel, and bang out a few formulas until you get a result. And all along the way, you’ve thought: There has to be a better way.

I’m here to tell you that in fact there is—the relational database. And I’m here to help.

First, a warning: I am not a professional database administrator. I am a hobbyist. I may not follow “best practices.” I will not be explaining a lot of relational database theory or lecturing you on the importance of the normal forms. Frankly, I don’t know too much about that myself. And if you have your own database setup and workflow already, your setup may differ from mine. I hope you can still derive some value from this.

Also, I run Windows, so this tutorial will be written for Windows users. If you use a Mac, the people at Beyond the Boxscore have some helpful hints.

Most modern databases use something called the Structured Query Language, or SQL. It is, yes, a programming language. But please, don’t let that scare you. It’s simply a way for you to "talk" to a database and ask it questions. We’ll get into that more in a minute. First, we need to set up a database.

Installing the software

There are many databases to choose from. I use MySQL. It’s free, which is nice. And there’s a large user base, which means there are plenty of tools and tutorials for you to look at.

First, you need to download and run the installer for MySQL 5.4 CE Essentials. Choose a “typical” install. You’ll typically have to wait a few minutes for files to copy. Then there will be some ads. Skip over them. Make sure that “Configure the MySQL server now” is checked. (I uncheck the box about registering the software with MySQL, but that’s up to you.) Chose “Detailed configuration.”

Then you’ll go through a set of pages asking you questions. Here are the answers:

  • Developer Machine
  • Non-Transactional Database Only
  • Decision Support (DSS)/OLAP

Continue on, using the defaults. When prompted, either add a password or uncheck the box to alter security settings. (This will leave your root password blank.) Once you’re finished, you shouldn’t have to start MySQL at all—it should be ever vigilant, waiting for you to command it to do your sabermetric bidding.

(If you get an error message that says "This installation package could not be started. Contact the application vendor to verify that this is a valid Windows Installer package," you need to download and install Windows Installer.)

Again sticking with the theme of “absolutely free,” I use SQLyog. You want the “Windows-Binary” from the Community Edition. Installation is straightforward—just download it and click “Next” a few times. Then run SQLyog.

From there, you’ll be presented with a screen asking you for the settings to connect to the database. Everything should be correct. If you selected a password, type it into the password blank. Then hit “Connect.”

And… voila! You’ve been connected to, well, to an empty database. I suspect you want some actual data in your database.

Putting the “data” in database

There are obviously a lot of different things you could want to have in your database, depending on your area of interest. Let’s start off modestly, if not exactly small. The Baseball Databank is a record of pretty much every official statistic for every baseball player in the history of the majors. (Right now, it’s current through 2008; the latest update is expected soon after the awards voting is
finished.)

First, download the ZIP file. Extract the .SQL file from the ZIP, and remember when you put it. Now go into SQLyog.

In SQLyog there will be three panes. The left-hand side should list your databases, the top is where you write queries, and the bottom shows the results. Let’s add a database. Right-click in the pane, and select “Create Database.” Call it “BDB,” leaving off the punctuation. Leave the rest as the defaults, and click okay.

A Hardball Times Update
Goodbye for now.

Now, let’s populate the database. Right-click on the database, and select “Restore from SQL Dump.” There should be an button with an elipsis (…) in it. Click that, and select the file we extracted from the ZIP. Select Execute. It’ll ask you if you want to execute the queries in database BDB. You do. This should take a little while.

Once that’s done, go into your database and click on the folder that says Tables. If nothing is there, try hitting “F5” on the top row of your keyboard to refresh the browser.

Go ahead and click on batting, and then slide on over to the right and click on “Table Data.” Scroll around for a bit. If you’ve ever used a spreadsheet or looked at a table before, the layout should feel familiar. Now it’s time to play around with the database a little.

Asking the database questions

The fundamental unit of the database is the table, which is further broken down into the row and the column. Each row is a “record” in the database; you can have as many records as you like. Every row has several fields which contain the data you’re interested in. A column is a group containing all of the fields of a certain type in the table.

You won’t be entering formulas into cells, like you would a spreadsheet. You interact with the data in the spreadsheet by writing queries. Let’s start off by looking up the hitting stats of the 1990 Royals.

Slide up into the query editor and type in this:

    SELECT *
    
FROM batting
    
WHERE yearID = "1990"
        
AND teamID = "KCA"

There is a green play button in the upper left-hand corner; press it. (Or just hit F9.) Most of the column headers should be pretty self-explanatory: H is hits, SF sacrifice flies, etc. And playerID is each player’s unique identifier, taken from the first five letters of the last name and the first two letters of the first name, plus a number. Some are easier to figure out than others.

To explain:

  • SELECT is your bread-and-butter SQL command; it tells the database to fetch records out of the table.
  • * is a wildcard; it tells the database to fetch all columns.
  • FROM tells the database what table to fetch from.
  • WHERE tells the database to filter the information.
  • AND means you want the database to match multiple columns.
  • The semicolon tells the database that our instruction is complete.

The equal sign is one way to compare two values in SQL. The quotation marks are important; if you don’t use quotation marks, MySQL will assume you’re referring to a column of that name.

In some results you’ll see a value of (NULL); that means the field is empty. In the batting table, NULLs are generally used to indicate:

  • Players who appeared in a game but had no official plate appearance (generally relief pitchers, pinch-runners and defensive replacements).
  • Statistics that were not tracked that year (generally sacrifice flies, double plays, etc.).

In this case, we really only want to see players who went to the plate. And let’s say we also want to know the “sabermetric triple crown” stats of batting average, on-base percentage and slugging percentage. (Tango’s sabermetric wiki is a great source of formulas to try out, incidentally.) We can add to our query a bit:

    SELECT *, H/AB AS AVG
        
, (H+BB+HBP)/(AB+BB+HBP+SF) AS OBP
        
, (H+2B+2*3B+3*HR)/AB AS SLG
    
FROM batting
    
WHERE yearID = "1990"AND teamID = "KCA"AND AB IS NOT NULL;

You don’t have to break up the different lines; I do that for legibility. You do want to separate each thing you want to select by commas. I put the commas at the start of each individual line, but that’s a personal style preference. SQL simply ignores carriage returns (that is to say, line breaks), tabs and extra spaces, so feel free to “pretty up” your code in a way that makes it easy for you to read.

SQL will do any simply arithmetic with the records you provide. (You can also do a lot more complex math with it if you really want to.) The AS command is very helpful—otherwise the column will end up being named “(H+BB+HBP)/(AB+BB+HBP+SF),” which simply isn’t very convenient.

Let’s say we don’t want every column, we just want a few columns:

    SELECT playerID, AB, H, 2B, 3B, HR, BB
            
, SO, IBB, HBP, SH, SF
            
, H/AB AS AVG
            
, (H+BB+HBP)/(AB+BB+HBP+SF) AS OBP
            
, (H+2B+2*3B+3*HR)/AB AS SLG
    
FROM batting
    
WHERE yearID = "1990"
        
AND teamID = "KCA"
        
AND AB IS NOT NULL;

Bored with the 1990 Royals yet? Let’s try something different.

    SELECT playerID, yearID, teamID, HR
    
FROM batting
    
ORDER BY HR DESC
            
LIMIT 50;

ORDER BY tells the database to sort the list; Desc tells it to go from large to small. (Asc is the corresponding command to sort from small to large.) LIMIT tells SQL to return only the first 50 records.

So that’s the top 50 home run seasons. (Supply your own asterisks where you feel it necessary.) But what if we want career totals?

    SELECT playerID, SUM(HR) AS HR
    
FROM batting
    
GROUP BY playerID
    
ORDER BY HR DESC
            
LIMIT 50;

The SUM command will sum everything in the column unless you stop it. GROUP BY tells the database to combine everything based upon player ID.

Now what if want to look at the top 50 career batting averages? We obviously don’t want to see some pitcher who lucked his way into going 2-for-3 and never saw the majors again. So let’s say we want a minimum 3,000 career plate appearances.

    SELECT playerID
            
, SUM(H)/SUM(AB) AS AVG
            
, SUM(AB+BB+HBP+COALESCE(SF,0)) AS PA
    
FROM batting
    
GROUP BY playerID
    
HAVING PA >= 3000
    
ORDER BY AVG DESC
            
LIMIT 50;

Because we’re using the SUM function, we have to use the HAVING function instead of the GROUP BY function. Note the >= – that’s greater than or equal to.

So, what’s that COALESCE doing there? Remember, sacrifice flies weren’t officially recorded until 1954. For years prior, those values are NULL in the database. Trying to sum a NULL value chucks everything out. Using the COALESCE function in this way tells the database to substitute 0 in the place of a NULL. Otherwise, players like Ty Cobb would be excluded from the list (precisely the sorts of players we don’t want excluded!)

Now let’s try writing a query to list the 50 lowest career earned run averages, minimum 1,000 innings pitched. The Databank doesn’t list IP for pitchers, simply outs. Remember to divide by three:

    SELECT playerID
            
, SUM(ER)/SUM(IPOuts/3)*9 AS ERA
            
, SUM(IPOuts/3) AS IP
    
FROM pitching
    
GROUP BY playerID
    
HAVING IP >= 1000
    
ORDER BY ERA ASC
            
LIMIT 50;

If you run this query, you’ll have a lot harder time recognizing the top leaders than the top batting average leaders. A lot of these guys were deadball-era pitchers. So let’s find these guys’ names.

Working with joins, a first glance

Often, all the data you want to use won’t be stored in a single table—you’ll have some data in one table, and some data in another. That’s where joins come into play. A join lets you hook two (or more) tables together for querying.

There are two ways to do a join: explicitly and implicitly. When I say an explicit join, I mean a join where you use the JOIN command. An implicit join is one where MySQL is just supposed to figure out that you want it to join. For now, we’ll start off with the implicit join.

In this case, what we want are the first and last names out of the master table. In this case, what we’re really after is a player’s full name. So let’s put that together:

    SELECT CONCAT(nameFirst," ",nameLast) AS Name, playerID
    
FROM MASTER;

CONCAT is a fun command for string manipulation. It lets you take multiple strings and combine them. In this case, we took the first name, a space (surrounded by quotes) and a last name, separating them with commas. CONCAT does the rest.

Instead of joining to the entire master table, we’re going to join to the results of that query using an alias.

In SQL, you use the period separator to indicate what table you want to take data from, like so:

table.column

where table is the name of the table and column is the name of the column. You can also assign an alias to a query when you use it in the FROM command, like so:

    SELECT p.playerID
            
, m.Name
            
, SUM(p.ER)/SUM(p.IPOuts/3)*9 AS ERA
            
, SUM(p.IPOuts/3) AS IP
    
FROM pitching p
            
, (SELECT CONCAT(nameFirst," ",nameLast) AS Name
              
, playerID
              
FROM MASTER) m
    
WHERE p.playerID = m.playerID
    
GROUP BY playerID
    
HAVING IP >= 1000
    
ORDER BY ERA ASC
            
LIMIT 50;

So now, instead of seeing that walshed01 is the career ERA leader, we can see that it’s Ed Walsh. It’s an improvement, right?

The key is in the WHERE clause, which matches up a column from one table with the other. The equivalent explicit join is:

    SELECT p.playerID, Name, ERA, IP
    
FROM (SELECT playerID
                
, SUM(ER)/SUM(IPOuts/3)*9 AS ERA
                
, SUM(IPOuts/3) AS IP
        
FROM pitching
        
GROUP BY playerID
        
HAVING IP >= 1000) p

    JOIN (SELECT CONCAT(nameFirst," ",nameLast) AS Name
                
, playerID
                
FROM MASTER) m
            
ON p.playerID = m.playerID
    
ORDER BY ERA ASC
            
LIMIT 50;

You want to be careful when writing the WHERE or ON clause of a join—if you aren’t careful, you’ll end up doing what’s called a Cartesian join, or essentially joining every row in one table to every row in another. For those of you keeping score at home, that would mean joining 39,016 rows from the pitching table to 17,264 rows from the master table, giving you… 673,572,224 rows as a result.

You do not want that.

We’re far from done with joins—they’re quite possibly the most powerful (and most dangerous!) tool in your SQL arsenal. But for right now let’s move along to some other things. We’ll be back.

Dealing with data types

I want you to see this at least once before it creeps up on you by accident. I do not, on the other hand, want to be the sort of tutorial writer who hands you a code sample and then, when it doesn’t work, shouts “Gotcha!”; So I’ll be clear with you: This doesn’t work right. Let’s say you want the top 50 player seasons with more runs scored than runs batted in.

Try running this:

    SELECT playerID
            , yearID
            
, teamID
            ,
R-RBI AS R_RBI
    
FROM batting
    
ORDER BY R_RBI DESC
            
LIMIT 50;

We are being lied to here. Nobody, but nobody, has 18,446,744,073,709,551,615 more runs than RBI in a single season. I don’t even know how to say that number.

The problem is with data types. SQL has an awful lot of data types, but we’re only going to look at a few of the more common ones.

Text is generally stored as “varchar,” or “Variable Character Field.” What that means is simply that you can store any length of text in there up to the size of the field. There are some other text formats, but nothing to really be worried about right now.

There’s a much greater diversity of number formats, which can be broken down into three types:

  1. Integers, or whole numbers.
  2. Decimals, for precise fractional values.
  3. Floats, for approximate fractional values.

These can be further subdivided: you’ll see values like “tinyint”; or “smallint”; thrown around, for example, as ways of describing the size of the number you can store in that column.

There’s also the issue of signs, which becomes important here. For SQL, what we mean by signs is the plus/minus sign—whether a number is positive or negative.

You can store numbers in MySQL either as “signed”; or “unsigned”;. Unsigned values cannot be negative. The Baseball Databank stores all numbers as unsigned.

(Why? Because it saves storage space and time.)

So in order for you to deal with unsigned numbers where values may be negative, you may have to change the datatype before doing any calculations. We do that with a CAST function:

    SELECT b.playerID
            , b.yearID
            
, b.teamID
            
, CAST(R-RBI AS SIGNED) AS R_RBI
    
FROM batting b
    ORDER BY R_RBI DESC
            
LIMIT 50;

The heart of the CASE

There are times when you want your query to be able to make some decisions on its own. Let’s say , for instance, that we want to look at pitchers between the ages of 25 and 29, post WWII. Remember, a player’s age is generally listed as his age come July 1. So, a simple little query here:

    SELECT p.playerID
            
, m.Name
            
, p.yearID
            
, (CASE WHEN m.birthMonth < 7
                    
THEN ( p.yearID - m.BirthYear )
        
ELSE ( p.yearID - m.BirthYear - 1 ) END) AS Age
            
, p.ER/p.IPOuts/3*9 AS ERA
            
, p.IPOuts/3 AS IP
    
FROM pitching p
            
, (SELECT CONCAT(nameFirst," ",nameLast) AS Name
                
, playerID
        
        , birthMonth
                
, birthYear
        
    FROM MASTER) m
    
WHERE p.yearID > 1943
        
AND p.playerID = m.playerID
    
GROUP BY playerID
    
HAVING Age BETWEEN 25
        
AND 29
    
ORDER BY ERA ASC
            
LIMIT 50;

The CASE statement allows you to have different outputs based upon the underlying data. This could have easily been written as an IF statement as well:

    , IF(m.birthMonth < 7
                
,( p.yearID - m.BirthYear )
                ,(
p.yearID - m.BirthYear - 1 )) AS Age

I like CASE because it feels cleaner, personally. You can also use multiple WHEN statements inside a CASE.

Saving your work

Let’s say you’ve written an amazing, bad-ass query. You know you’re going to want to look at this one later. You have two options:

    CREATE VIEW view_name AS
    CREATE TABLE
table_name AS

Simply put either one of those at the front of your query.

So what’s the difference? A view:

  1. Updates whenever the underlying data does.
  2. Preserves the query used to generate it.

Creating a table:

  1. Does not update to incorporate changes in data.
  2. Does not run the query every time it’s called.

A view is more flexible; a table is quicker.

Also, I recommend creating a new database to store tables in—multiple databases if you have multiple projects. So let’s say we wanted to start a new database for a project that does Marcels projections. Right-click in the pane on the left and select “Create Database.” Call it Marcels and click on it to make it the active database. To run queries on tables in the Baseball Databank, simply prefix the names of the tables with bdb., like so:

bdb.batting

bdb.pitching

Well after that little slip, it would hardly be fair to send you on your way without some Marcels projections, wouldn’t it?

First off, I just want to thank Tangotiger, who came up with the Marcels, and Sal Baxamusa, whose spreadsheet versions of the Marcels were invaluable as learning aids for me.

I tried my best to reimplement the Marcels as faithfully as possible, but there are (likely) differences between Tango’s work and mine. All credit goes to him and Sal, all blame for mistakes and errors goes to me.

Here’s the code for hitter projections. Have fun with it, play around with it. Try to make it better, try to break it, try to take it apart and see how it works.

That should be enough to get you started. If you have questions, don’t be afraid to ask! And most of all, have fun.

References & Resources

If all of this seems a little overwhelming, can I recommend the Baseball Reference Play Index?

Thanks to the Simple-Talk Code Prettifier, which I used to format the code samples in the article.

Some helpful resources:

Ready to move up to a full play-by-play database? Try this on for size.


34 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Tom M. Tango
14 years ago

Very, very nice Colin.

***

CONCAT(nameFirst,” “nameLast)

Not: CONCAT(nameFirst,” “, nameLast) ??

Nick Steiner
14 years ago

Great job Colin.

Adam W
14 years ago

I could be wrong, and I sometimes am, but I think you’re slugging calc is off.

You use:
(H+2B+2*3B+3*HR)/AB AS SLG

I think this should actually be:
(H+2B*2+3B*3+HR*4)/AB AS SLG

Unless I am misunderstanding something.

Colin Wyers
14 years ago

Tom, you are correct. We’ll get that fixed here shortly. Sorry about that.

Adam – no, the formula is correct as written. Remember that H is:

1B+2B+3B+HR

So when you expand it out:

(1B+2B+3B+HR+2B+2*3B+3*HR)

or

(1B+2*2B+3*3B+4*HR)

Otherwise you double-count all extra base hits.

Adam W
14 years ago

I knew I was missing something.  I see where my misunderstanding is now.  Thanks.

obsessivegiantscompulsive
14 years ago

Thank you so much for this article!!! 
Much appreciated!!!

I have been wanting to do something like this but the technical details of setting up the MySQL was daunting even though I was once a programmer and system manager.

Happy Turkey-Day!

RZ
14 years ago

This is so much helpful for a learning SQL and applying it to my pitch f/x database.

Janson
14 years ago

Amazing. I feel like the sun just cracked the horizon and a new day is starting.

Just so you know, I will use SAS to ingest that database you pointed at and get lots of off the shelf statistical tools along with it. In any case, this is an amazing instruction set.

Lee Panas
14 years ago

Very good job Colin.  I use old school SAS methods rather than SQL but SQL is generally the way to go and this will give people a good place to start.

Lee

john
14 years ago

Good stuff.

I got a question.  If I already have data from baseball databank for a previous season is there a way to update it or do I just walk through the steps each year and write over the previous years database?

I liked to see more on SQL querries….more complicated ones.  I’m at a point where I have the baseball databank database, game log database, retrosheet database and pitchfx database, however whenever I want to run a querry I have problems extracting exactly what I’m wanting.

For example, we were talking on Metsgeek a few weeks ago, we wanted to know how many players in baseball history had OBP less than their AVG for a season (why we were wanting to know this im not so sure lol) but I have the databases just not the experience writing querries to access the info I want.

Boozer
14 years ago

Cool stuff! This is the first time I’ve ever tried to put a DB together. Get this error on the last one though:

Error Code : 1054
Unknown column ‘m.birthMonth’ in ‘field list’

I can’t figure out what’s wrong:

, (CASE WHEN m.birthMonth < 7
      THEN ( p.yearID – m.BirthYear ) 
      ELSE ( p.yearID – m.BirthYear – 1 ) END) AS Age

Tim
14 years ago

I am planning on quitting my job and playing with this full time.  My wife is not happy with you or me right now.

Sal Paradise
14 years ago

This is the sort of stuff we need more of.

The quicker we can get people access to the tools, the more people we will have creating content to sift through, think about, and improve on.

Michael
14 years ago

Thanks for this Colin. I was at a loss when I couldn’t find the old article on StatSpeak due to MVN’s format changes. This should be a good refresher for me.

John Burnson
14 years ago

Boozer- The text has been updated.

J. Cross
14 years ago

Thanks! This is fantastic.  I just got MySGL and the BDB on my mac as well as the code for Marcels and I’m having a good time playing around with all of it.

John Burnson
14 years ago

john- The key is your WHERE clause. You need a clause like:

WHERE teamID = “NYN” AND (H+BB+HBP)/(AB+BB+HBP+SF)< H/AB

The first bit filters by team, the second bit filters by OBP < BA, and the “AND” says to filter by both. (You can also use “OR”, and you can include many “AND” and/or “OR” clauses.)

When working with ratios, one thing to watch out for is “Divide by zero” errors. Here, this could happen if your query fetches records where AB=0. So you probably need to add an “AND”:

WHERE teamID = “NYN” AND (H+BB+HBP)/(AB+BB+HBP+SF)< H/AB AND AB>0

Now, for something with the breadth of Baseball Databank, finding the right team ID can be a trick. To see all available ID’s, you can do a SELECT DISTINCT:

SELECT DISTINCT teamID FROM batting;

You can even sort the results alphabetically:

SELECT DISTINCT teamID FROM batting ORDER BY teamID;

But the right ID can still be non-obvious. The safest route is to pick a player from your team, search for him in MASTER, and then look for his seasons in the hitting or pitching table (because the MASTER table doesn’t have teams). To find the code for the Mets, I first looked up Mike Pelfrey:

SELECT playerID FROM master WHERE nameLast = “Pelfrey” and nameFirst = “Mike”;

And then I got the team from the pitching table:

SELECT teamID FROM pitching WHERE playerid = “pelfrmi01”;

(Actually, this result returns “NYN” three times, since Pelfrey has three team-seasons in the pitching table. If you wanted only one row returned, you could SELECT DISTINCT teamID.)

You could even run the two statements in one step:

SELECT teamID FROM pitching WHERE playerid = (SELECT playerID FROM master WHERE nameLast = “Pelfrey” and nameFirst = “Mike”);

Here, we select for all records from the pitching table where the playerID is the result of the parenthetical select. SQL can manage both these selects at the same time and in the right order.

As for your question… Baseball Databank finds only two player-seasons where OBP<BA, and none in more than 2 AB. Neither was a Met (though one was a Yankee).

The player ID’s are ‘clarkro02’ and ‘thronma01’. To identify these guys, query the MASTER table:

SELECT nameLast, nameFirst from MASTER WHERE playerID IN (“clarkro02”,“thronma01”);

The “IN” filters based on a list of acceptable values that we provide.

Here are your culprits:

Clark, Ron
Throneberry, Marv

John s
14 years ago

Wow thanks

the queries seem so obvious once you lay them out lol

HCRumb
14 years ago

One of the strangest ideas in the quantitative analysis of baseball is the idea that knowing how to query an sql database is tantamount to being proficient with statistical methods.

Over and over you see this.  Statistical methods being implicitly or explicitly reduced to database queries or proficiencies with a technology.

The technology is not the methodology. It’s not even particularly important.

Brian Cartwright
14 years ago

Step One is having access to the data, which Colin has explained here. Step Two is knowing what to with it.

Ian
14 years ago

RCrumb: The technology is particularly important.  How much of this sabermetric work could have been done (in a reasonable amount of time) without computers?  Regressions take forever by hand, and even simple questions (what happens to a player the year after a leap in performance) require databases to answer. 

Colin, this is great, wish I’d had it a couple years ago.  Aspiring sabermetricians everywhere thank you.

Peter
14 years ago

Colin,
I think this article does more to advance sabermetric research than anything published in a long time.  Excellent work.

Brian Cartwright
14 years ago

I use MySQL now after starting on Access.

Question 1: I haven’t used views yet, which you say are slower than tables. If I go thru several sequential steps, creating a table at each to get to my final output, I have to run queries in the correct order to create each supporting table. If all the intermediate steps are done as views instead, is this slower to get to the end?

Question 2: I already run MySQL 5.0. Your link above is for 5.4. Is there any problem installing this over an existing 5.0?

Comment 1: IFNULL does the same as COALESCE, with the same arguments. The name is self explanatory and easier to spell.

Comment 2: Good work! Despite being an experienced MySQL user in sabermetrics, I know I don’t know everything and learned several tips from this article.

Frank Pereiro
14 years ago

Nice job.

I have a question on the Marcel projections. Are those projections for the 2009 season only?

Thanks, keep up the great work

Jim Furtado
14 years ago

Nice write up. SQL isn’t the easiest thing for database novices. For people with Microsoft Access on your computer, you can use Microsoft Access as a front-end to the data. Here’s the Mysql instructions on how to do it:
http://dev.mysql.com/doc/refman/5.0/en/connector-odbc-examples-tools-with-access.html#connector-odbc-examples-tools-with-access-linked-tables

tom [NOT my username on 'Net]
14 years ago

Colin:

I only found your site and this page after perusing past posts on baseball-fever late last night.

Specifically, I read the entire thread on sabermetric evaluation of catchers [started with defense runs, Oiazza vs. I-Rod], then ran into a sensational set of tangents having to do with player position challenges and values.

In sum [and without mentioning other names] thank you for sanity, clarity, organization and extreme patience in that series of discussions … both enlightening and perceptive, without a smidgen of self promotion.

Thanks for that thread, which opened the door to this data – priceless.

Atticus
14 years ago

I would also be very curious to see the code we can use to create to generate Marcel pitching projections in addition to the code already available (linked above) for the hitting projections.

I have searched Google, BaSQL, and the Inside The Book blog and cannot find the SQL for pitching Marcels.

Anyone know if this is available anywhere online?

Colin Wyers
14 years ago

I’ve done you one better and searched through several hard drives of mine, and even I can’t find my SQL code for pitcher Marcels. I know a lot of people want it, and so one of these days I may just sit down and hash it out again.

That said – everything in those two StatSpeak articles you linked is contained in this article here. I did have a copy of the originals, and the bulk of the text presented here is ripped straight from that. (A lot of trims were made where I felt I was being too wordy, and some of the examples were changed where I felt I could make improvements either for clarity or to correct mistakes.)

And the Retrosheet tutorial was rewritten as well for THT:

http://www.hardballtimes.com/main/article/building-a-retrosheet-database-the-short-form/

In almost every respect I think that version is superior to the original. It has been pointed out to me that some tables (rosters and teams, I think) are missing from the updated version. I don’t know that I use those tables – I typically can find whatever I need in the Baseball Databank teams and master tables instead. If there’s a great need for them, I can try to get that figured out and put on BaSQL at some point.

Atticus
14 years ago

Thanks very much for the response Colin. It is good to know the two statspeak.net articles have actually been captured in your article above.

It is even better to know you might be able to work on the pitching Marcels code and make it available on BaSQL.

I trust you know such articles and efforts, by you and others, is very much appreciated. Not just by me, either. I think I can safely say by all consumers of your efforts and those of others. Right now many of us are sabermetric enthusiasts, but seeing what others have done may just plant the seed for future contributors to the community.

Atticus
14 years ago

Does anyone know where the following two articles Mr. Wyers wrote at Statistically Speaking went? Or, for that matter, where Statistically Speaking went? I am using the article above and the BtB Saberizing A Mac community project to set up my own databases and have seen more than one mention or link to the statspeak.net Sabermetricians Workbench article but have not been able to access them, including at the Internet Archive.

http://statspeak.net/2008/11/building-a-sabermetricians-workbench-part-i.html

http://statspeak.net/2008/11/building-a-sabermetricians-workbench-part-ii.html

Ben
12 years ago

Thank you so much! This just made my week!

Burberry bags outlet
12 years ago
Zac
9 years ago

Hi sorry to bother you with a bit of a beginner question, but, whenever I copy the batting projections code and execute the query it comes up with this error:

1 queries executed, 0 success, 1 errors, 0 warnings

Query: CREATE TABLE batting_pos AS SELECT b.playerID , CAST(b.yearID AS SIGNED) AS yearID , POW(0.9994,(2009-CAST(b.yearID AS SIGNED))*…

Error Code: 1146
Table ‘bdb.batting’ doesn’t exist

Execution Time : 0 sec
Transfer Time : 0 sec
Total Time : 0.004 sec
I’m only young so please don’t attack me for being a massive noob. Just interested in sabermetrics and wanted to give this a go. Thanks!

Shay
9 years ago

Jeez dude, this is superb. Thanks for getting me started in a clear, easy to understand way. Mad props.