If you follow me at all you’ll know that I love R — the statistical programming language. There is a bit of a learning curve, but it’s pretty minimal compared to some other languages and software programs. Best of all, it’s free and there is a massive network of contributors that are constantly building new packages that make it extremely easy to apply all sorts of techniques and functions easily to your data.
Our fearless editor, Paul Swydan, asked if I would write up what R packages I regularly use.
There are some great resources out there for learning R and for learning how to analyze baseball data with it. In fact, a few pretty smart people wrote a fantastic book on the subject, coincidentally titled Analyzing Baseball Data with R. I can’t say enough about this book as a reference, both for baseball analysis and for R. Go and buy it. What follows is in no way a substitute for that book; instead, think of this as a quick reference based on some of the tools that I regularly use (or in some cases, should probably use more).
I would also highly recommend the free, on-line edX course Sabermetrics 101. The course is run by Andy Andres and features not only an introduction to sabermetric analysis, but also SQL and R. I walked through the first version and have heard that the latest version is even better. There’s also the three-part series (parts one, two and three) Brice Russ did at TechGraphs on using R for sports stats.
Note that what follows is really meant for those just getting started with R, or who haven’t yet used R for baseball research, rather than those who are more experienced.
Before we actually analyze anything, we need to make sure we have R set up. This is pretty simple to do, and there are about 440,000 search results for installing R and RStudio, so I’ll just provide a very high-level view of how to do this.
First, get yourself over to CRAN (The Comprehensive R Network) and, on the first page, you will see links to download and install R for either Linux, Mac or Windows.
Second, after you’ve installed the last version of R, I highly recommend grabbing an IDE (Integrated Development Environment), specifically RStudio. An IDE, in case you’re not familiar with the concept is a programming environment that has been packaged as an application program, typically consisting of a code editor, a compiler, a debugger, and a graphical user interface (GUI) builder. Pretty snazzy. RStudio is free to install and makes working with R even easier than it already is. Here’s a screen shot from my setup:
In the upper left is the console where you can input commands and view some output. You can view data sets and source code in the bottom left window. The right side has windows for viewing the objects available in your current environment–like data sets–as well as an area to view and install packages, plots that you’ve created, and search for help.
Also, you will need to load the various packages into R from CRAN (and from beyond the CRAN). An easy way to do this is by using the pacman package. First, install the package:
Once you have pacman installed you can use the p_load function to install and load multiple packages at once, simply by typing in the name of the package. For example:
If either of the two packages (Lahman or dplyr) are not already installed on your system, p_load will do so before loading it into your R session, which is pretty convenient.
You can’t analyze baseball data without the data. Thankfully, the Lahman package makes it easy to get started.
As the name suggests, the Lahman package allows you to access the incredible Lahman database without having to actually download and install the database itself. The package is essentially just a collection of all the tables from the Lahman database in a set of data frames. Let’s load the package:
When you load the package the global environment won’t show you anything, but, trust me, the data are there. The documentation linked to above has a full accounting of the data frames included, but basically they mirror the separate tables available in the regular Lahman database. The first few rows of the Master table can be viewed using the head function:
The key to the Lahman package is that to get the most out of it you will need to perform SQL-like queries on the tables in R. There are multiple ways to do this, a few of which I will explore in the next section on data-manipulation packages like dplyr and sqldf.
Of course, Lahman doesn’t include play-by-play or PITCHf/x data. Thankfully, there are a few other packages you can use to grab this information.
In terms of PITCHf/x data, the best package I’ve seen is Carson Sievert’s pitchRx. It’s just phenomenal. There is no way to cover all of its features here, so I’ll just introduce it for the moment.
The package allows you to scrape specific data or build and store your own PITCHf/x database. Let’s say you want to view data from Jacob deGrom’s May 21 start against the Cardinals. The scrape function makes this extremely easy:
Essentially, you create two tables — locations, which pulls from the pitch table, and names, which pulls from the at-bat table — and then join them together, filtering on the pitcher’s name. Here are the first six rows of data:
DeGrom threw a gem that day, striking out 11 and walking none over eight innings, and we now have pitch type, speed, location and result data for each of the 104 pitches he threw in that game. Analyzing the data, however, requires the use of some other packages–like dplyr–which we will get into below.
The problem with PITCHf/x data is that the system came online only in 2008, and the data took some time to become both comprehensive and reliable. Long before we had PITCHf/x, however, we had the amazing Retrosheet. The fact that these data are freely available is just tremendous, but what if you don’t want to deal with a bunch of csv files, or build your own database? Well, there is a new package out — creatively titled retrosheet — which looks promising.
I have not used this package much, but I think it’s worth exploring more. For example, you can pull roster data for a given year and look at specific teams with a single line of code. Here’s how to pull the 1969 Mets roster, with the first 10 players shown:
With just a few more lines of code, you could also pull their schedule for that season:
I encourage you to play around with it, as you can also pull event and game log data as well.
The last baseball-specific package is the ambitious openWAR project by Ben Baumer, Shane Jensen and Gregory Matthews. I say it’s ambitious because it isn’t just a package that is useful for gathering data, but it aims to implement a more transparent and reproducible version of Wins Above Replacement, as well as provide transparency into the uncertainty of our estimates of individual player WAR.
The package relies on parsing data from MLB’s Gameday server, which is the same as the pitchRx package above, except that it pulls the results of at-bats instead of every pitch. I have not used open WAR much, but it is on my list to explore in greater detail. That being said, I highly recommend diving into it as it appears to be a great way to not only grab data, but also analyze player performance in a rigorous way.
As cool as these packages are, one can’t live by baseball-themed packages alone. You need some help manipulating the data, and that’s where we will focus next.
You probably noticed in some of the code above some additional packages and functions that were not part of the baseball-specific packages. Those I am characterizing as data-manipulation packages and they are every bit as important to conducting any kind of analysis in R, baseball or otherwise.
Now, there are tons of packages one could use to manipulate data in R. Here, I’ll outline a few I find most useful on a day-to-day basis.
Connecting to Databases
Let’s say you have a database, either on your hard disk or one you connect to remotely, that you want to interact with from with the R environment. There are a few packages you could leverage, but the one I currently use is RMySQL. RMySQL allows you to establish a connection to your database and then perform regular SQL queries on the data to your heart’s content.
As an example, assume you have the Lahman database installed on your computer already. Rather than fire up your favorite SQL tool, run a query, export the data, and then import into R for analysis you can simply do all of this from within R.
First, you need to provide your connection information and save that as an object–we’ll call it con:
We now have a data set with 58 cases, which align to all instances in baseball history where a player has amassed 30 home runs and 30 stolen bases in a single season. Here are the first six records:
The vast majority of most analysis consists of data acquisition, and more importantly, data munging–essentially, cleaning and manipulating the data into the right form for whatever particular analysis you want to conduct.
Sometimes you already have a data set, or multiple data sets, loaded into R that are not accessible in some sort of database and you need to merge them together. For example, let’s say you had a table of player names along with some type of player IDs, and another table with player statistics but no names. This is a very simplified example, but one we run into all the time. Just look at our RMySQL example above. We needed to join the player name from the Master table to the player’s performance data from the Batting table. If you aren’t working in a database you might just pull open both tables in Excel and use the VLOOKUP function to merge the two. But if you have them in R you can just use the SQL syntax you are used to by leveraging the sqldf package.
Sqldf is an easy-to-use package that allows you to manipulate separate data objects in R as if they were tables in a database. For simplicity’s sake, assume you have the Lahman Master and Batting tables downloaded as csv files and you’ve uploaded both into R. Recreating the 30-30 club dataset above is incredibly easy with sqldf:
Presto! Can’t get much easier than that.
The granddaddy of them all, however, is arguably Hadley Wickham’s dplyr package. As much as I loved using sqldf, someone told me that eventually dplyr would become my go-to package for almost any analysis and they were right. With dplyr you can filter and slice data, select and reorder columns and variables, group and summarize data, and join data sets in much the same way you would using SQL-style queries. It is the Swiss army knife of R for data junkies.
The most important thing to know about dplyr is how to use the pipe operator, or %>%. The pipe operator simply takes whatever value is on its left and pipes it to the first position on to its right, or wherever you place a period.
Let’s return to our 30-30 club data set. First, we can create that data set from scratch with dplyr:
We can then look at which players appeared the most times on the list:
Or we can look at which seasons produced the most 30-30 players:
Or the teams with the most 30-30 seasons:
One of my favorite uses for dplyr is for creating year-to-year data sets when I want to compare player performance or create aging curves.
As a quick example, let’s say we want to see which players saw the greatest increase in their home run rate (home runs per 600 balls in play) between 2000 and 2010. We can use Lahman and dplyr to pull this together pretty easily. We will limit the data set to those that had at least 400 at-bats in both the first and second season:
Man, there’s that Barry Bonds character again.
You get the picture. Bottom line, there a million ways to leverage dplyr and once you get up to speed on its functions you’ll be amazed how much easier it makes your life.
No matter how robust your own database, there are usually more data you’d like to have access to. Take team records on every date in a given season. The best source for this I have seen for this is Baseball-Reference’s “Standings on Any Date” feature. But what if you want every day from Opening Day until the playoffs? That’s a lot of manual work. Here’s where R and a package like XML can come in very handy.
First, take a look at the url for any date, say opening day this year:
All we need to do is change the year, month and day entries in the url to jump to another date. XML will allow you to scrape all data tables present at a given url, and then select the table you want. Here’s what this looks like in R if we want to pull the National League East standings for Aug. 31, 2015:
Now, what really saves time is creating a list of dates and then letting R do the work of pulling all the records for each date for you. Behold:
You now have a data set with 745 rows, one each for every team’s record on every date in your sequence.
Visualizing the Data
There are several books dedicated to using R for creating visualizations. Here I’ll just touch on my go-to package, which not surprisingly is ggplot2 — it is widely hailed as the best visualization package for R. The base of R does include various plotting tools, but ggplot2 gives you a ton of power over just about every aspect of the visual you want to create. The code does take some getting used to, but once you get the hang of it you can do some amazing stuff.
For now, let’s say we want to take and visualize all the National League East standings data. Here’s how you might approach it using our existing data, dplyr and ggplot2:
And here’s the result:
We could also plot our PITCHf/x data from earlier. The pitchRx package does have some native graphic options, but we can create our own just for practice. Let’s plot the location of each of deGrom’s swinging strikes from that May 21 start, and color code each pitch by velocity:
And the result:
We can see that the velocity of the pitches that generated swinging strikes is directly related to how high in the zone they were. This makes sense when we see what pitch types were thrown to which locations:
Those low swinging strikes were generated off of curveballs, and the higher strikes were four-seam fastballs.
I hope this is helpful, especially to those who are new to using R and thinking about how to effectively conduct baseball research using the language. You can find all the code, images, and the openWAR 2015 data file at my GitHub repository for this post. I also have a number of public repositories that include R code for other baseball-related projects, so feel free to have a look around.
There is a lot more I could have covered, specifically inferential statistics, modeling and machine learning. If it’s useful, I might cover those packages and techniques in a follow-up post. Let me know in the comments. And feel free to suggest other packages I may have missed or should consider diving into further, as well as any code improvements.
References & Resources
- Bill Petti’s Github repository
- Max Marchi and Jim Albert, Analyzing Baseball Data With R
- The Comprehensive R Archive Network (CRAN)
- Carson Sievert, “pitchRx” data package
- Richard Scriven, “retrosheet” data package
- Ben Baumer, Shane Jensen and Gregory Matthews, “openWAR” data package
- Hadley Wickham, “ggplot2” data package
- CRAN, “Introduction to dplyr”
- CRAN, Lahman data package PDF
- CRAN, RMySql data package PDF
- CRAN, sqldf data package PDF
- Dan Kopf, Priceonomics, “Hadley Wickham, the Man Who Revolutionized R”
- Atmajitsinh Gohil, R Data Visualization Cookbook
- Winston Chang, R Graphics Cookbook
- Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis (Use R!)
- Nathan Yau, Visualize This: The FlowingData Guide to Design, Visualization, and Statistics