A Short(-ish) Introduction to Using R Packages for Baseball Research

by Bill Petti
September 21, 2015

This visual from Jacob deGrom’s May 21 start is just one example of the outputs you can create in R.

Introduction

If you follow me at all you’ll know that I love R — the statistical programming language. There is a bit of a learning curve, but it’s pretty minimal compared to some other languages and software programs. Best of all, it’s free and there is a massive network of contributors that are constantly building new packages that make it extremely easy to apply all sorts of techniques and functions easily to your data.

Our fearless editor, Paul Swydan, asked if I would write up what R packages I regularly use.

There are some great resources out there for learning R and for learning how to analyze baseball data with it. In fact, a few pretty smart people wrote a fantastic book on the subject, coincidentally titled Analyzing Baseball Data with R. I can’t say enough about this book as a reference, both for baseball analysis and for R. Go and buy it. What follows is in no way a substitute for that book; instead, think of this as a quick reference based on some of the tools that I regularly use (or in some cases, should probably use more).

I would also highly recommend the free, on-line edX course Sabermetrics 101. The course is run by Andy Andres and features not only an introduction to sabermetric analysis, but also SQL and R. I walked through the first version and have heard that the latest version is even better. There’s also the three-part series (parts one, two and three) Brice Russ did at TechGraphs on using R for sports stats.

Note that what follows is really meant for those just getting started with R, or who haven’t yet used R for baseball research, rather than those who are more experienced.

Getting Started

Before we actually analyze anything, we need to make sure we have R set up. This is pretty simple to do, and there are about 440,000 search results for installing R and RStudio, so I’ll just provide a very high-level view of how to do this.

First, get yourself over to CRAN (The Comprehensive R Network) and, on the first page, you will see links to download and install R for either Linux, Mac or Windows.

Second, after you’ve installed the last version of R, I highly recommend grabbing an IDE (Integrated Development Environment), specifically RStudio. An IDE, in case you’re not familiar with the concept is a programming environment that has been packaged as an application program, typically consisting of a code editor, a compiler, a debugger, and a graphical user interface (GUI) builder. Pretty snazzy. RStudio is free to install and makes working with R even easier than it already is. Here’s a screen shot from my setup:

In the upper left is the console where you can input commands and view some output. You can view data sets and source code in the bottom left window. The right side has windows for viewing the objects available in your current environment–like data sets–as well as an area to view and install packages, plots that you’ve created, and search for help.

Also, you will need to load the various packages into R from CRAN (and from beyond the CRAN). An easy way to do this is by using the pacman package. First, install the package:

install.packages("pacman")

Once you have pacman installed you can use the p_load function to install and load multiple packages at once, simply by typing in the name of the package. For example:

p_load(Lahman, dplyr)

If either of the two packages (Lahman or dplyr) are not already installed on your system, p_load will do so before loading it into your R session, which is pretty convenient.

Baseball-specific Packages

You can’t analyze baseball data without the data. Thankfully, the Lahman package makes it easy to get started.

As the name suggests, the Lahman package allows you to access the incredible Lahman database without having to actually download and install the database itself. The package is essentially just a collection of all the tables from the Lahman database in a set of data frames. Let’s load the package:

A Hardball Times Update

by RJ McDaniel

Goodbye for now.

p_load(Lahman)

When you load the package the global environment won’t show you anything, but, trust me, the data are there. The documentation linked to above has a full accounting of the data frames included, but basically they mirror the separate tables available in the regular Lahman database. The first few rows of the Master table can be viewed using the head function:

head(Master)

   playerID birthYear birthMonth birthDay birthCountry birthState  birthCity
1 aardsda01      1981         12       27          USA         CO     Denver
2 aaronha01      1934          2        5          USA         AL     Mobile
3 aaronto01      1939          8        5          USA         AL     Mobile
4  aasedo01      1954          9        8          USA         CA     Orange
5  abadan01      1972          8       25          USA         FL Palm Beach
6  abadfe01      1985         12       17         D.R.  La Romana  La Romana
  deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
1        NA         NA       NA         <NA>       <NA>      <NA>     David
2        NA         NA       NA         <NA>       <NA>      <NA>      Hank
3      1984          8       16          USA         GA   Atlanta    Tommie
4        NA         NA       NA         <NA>       <NA>      <NA>       Don
5        NA         NA       NA         <NA>       <NA>      <NA>      Andy
6        NA         NA       NA         <NA>       <NA>      <NA>  Fernando
  nameLast        nameGiven weight height bats throws      debut  finalGame
1  Aardsma      David Allan    205     75    R      R 2004-04-06 2013-09-28
2    Aaron      Henry Louis    180     72    R      R 1954-04-13 1976-10-03
3    Aaron       Tommie Lee    190     75    R      R 1962-04-10 1971-09-26
4     Aase   Donald William    190     75    R      R 1977-07-26 1990-10-03
5     Abad    Fausto Andres    184     73    L      L 2001-09-10 2006-04-13
6     Abad Fernando Antonio    220     73    L      L 2010-07-28 2013-09-27
   retroID   bbrefID  deathDate  birthDate
1 aardd001 aardsda01       <NA> 1981-12-27
2 aaroh101 aaronha01       <NA> 1934-02-05
3 aarot101 aaronto01 1984-08-16 1939-08-05
4 aased001  aasedo01       <NA> 1954-09-08
5 abada001  abadan01       <NA> 1972-08-25
6 abadf001  abadfe01       <NA> 1985-12-17

The key to the Lahman package is that to get the most out of it you will need to perform SQL-like queries on the tables in R. There are multiple ways to do this, a few of which I will explore in the next section on data-manipulation packages like dplyr and sqldf.

Of course, Lahman doesn’t include play-by-play or PITCHf/x data. Thankfully, there are a few other packages you can use to grab this information.

In terms of PITCHf/x data, the best package I’ve seen is Carson Sievert’s pitchRx. It’s just phenomenal. There is no way to cover all of its features here, so I’ll just introduce it for the moment.

The package allows you to scrape specific data or build and store your own PITCHf/x database. Let’s say you want to view data from Jacob deGrom’s May 21 start against the Cardinals. The scrape function makes this extremely easy:

library(pitchRx)
library(dplyr)
dat <- scrape("2015-05-21", "2015-05-21")
locations <- select(dat$pitch, pitch_type, start_speed, px, pz, des, num, gameday_link)
names <- select(dat$atbat, pitcher, batter, pitcher_name, batter_name, num, gameday_link, event, stand)
data <- names %>% filter(pitcher_name == "Jacob DeGrom") %>% inner_join(locations, ., by = c("num", "gameday_link"))

Essentially, you create two tables — locations, which pulls from the pitch table, and names, which pulls from the at-bat table — and then join them together, filtering on the pitcher’s name. Here are the first six rows of data:

head(data)

  pitch_type start_speed     px    pz             des num
1         CU        80.5  0.064 2.930   Called Strike   1
2         SL        89.8 -0.839 2.037   Called Strike   1
3         FF        96.9 -1.568 4.558            Ball   1
4         FF        95.5 -0.565 4.135 Swinging Strike   1
5         FF        95.4 -1.381 2.949            Ball   2
6         SL        91.4 -0.596 1.755   Called Strike   2
                    gameday_link pitcher batter pitcher_name    batter_name
1 gid_2015_05_21_slnmlb_nynmlb_1  594798 543939 Jacob DeGrom    Kolten Wong
2 gid_2015_05_21_slnmlb_nynmlb_1  594798 543939 Jacob DeGrom    Kolten Wong
3 gid_2015_05_21_slnmlb_nynmlb_1  594798 543939 Jacob DeGrom    Kolten Wong
4 gid_2015_05_21_slnmlb_nynmlb_1  594798 543939 Jacob DeGrom    Kolten Wong
5 gid_2015_05_21_slnmlb_nynmlb_1  594798 572761 Jacob DeGrom Matt Carpenter
6 gid_2015_05_21_slnmlb_nynmlb_1  594798 572761 Jacob DeGrom Matt Carpenter
      event stand
1 Strikeout     L
2 Strikeout     L
3 Strikeout     L
4 Strikeout     L
5    Single     L
6    Single     L

DeGrom threw a gem that day, striking out 11 and walking none over eight innings, and we now have pitch type, speed, location and result data for each of the 104 pitches he threw in that game. Analyzing the data, however, requires the use of some other packages–like dplyr–which we will get into below.

The problem with PITCHf/x data is that the system came online only in 2008, and the data took some time to become both comprehensive and reliable. Long before we had PITCHf/x, however, we had the amazing Retrosheet. The fact that these data are freely available is just tremendous, but what if you don’t want to deal with a bunch of csv files, or build your own database? Well, there is a new package out — creatively titled retrosheet — which looks promising.

I have not used this package much, but I think it’s worth exploring more. For example, you can pull roster data for a given year and look at specific teams with a single line of code. Here’s how to pull the 1969 Mets roster, with the first 10 players shown:

retro <- getRetrosheet("roster", 1969)

retro$NYN

retroID      Last  First Bat Throw Team Pos
1  ageet101      Agee Tommie   R     R  NYN   X
2  boswk101   Boswell    Ken   L     R  NYN   X
3  cardd101  Cardwell    Don   R     R  NYN   X
4  chare101   Charles     Ed   R     R  NYN   X
5  clend101 Clendenon   Donn   R     R  NYN   X
6  collk101   Collins  Kevin   L     R  NYN   X
7  dilaj101   DiLauro   Jack   B     L  NYN   X
8  dyerd101      Dyer  Duffy   R     R  NYN   X
9  frisd101  Frisella  Danny   L     R  NYN   X
10 garrw101   Garrett  Wayne   L     R  NYN   X

With just a few more lines of code, you could also pull their schedule for that season:

retro_sch <- getRetrosheet("schedule", 1969)
NYMa <- filter(retro_sch, VisTeam == "NYN")
NYMh <- filter(retro_sch, HmTeam == "NYN")
NYM1969 <- rbind(NYMa, NYMh) %>% arrange(Date)

Date GameNo Day VisTeam VisLg VisGmNo HmTeam HmLg HmGmNo TimeOfDay
1 19690408      0 Tue     MON    NL       1    NYN   NL      1         d
2 19690409      0 Wed     MON    NL       2    NYN   NL      2         d
3 19690410      0 Thu     MON    NL       3    NYN   NL      3         d
4 19690411      0 Fri     SLN    NL       4    NYN   NL      4         d
5 19690412      0 Sat     SLN    NL       5    NYN   NL      5         d
6 19690413      0 Sun     SLN    NL       6    NYN   NL      6         d
  Postponed Makeup
1        NA     NA
2        NA     NA
3        NA     NA
4        NA     NA
5        NA     NA
6        NA     NA

I encourage you to play around with it, as you can also pull event and game log data as well.

The last baseball-specific package is the ambitious openWAR project by Ben Baumer, Shane Jensen and Gregory Matthews. I say it’s ambitious because it isn’t just a package that is useful for gathering data, but it aims to implement a more transparent and reproducible version of Wins Above Replacement, as well as provide transparency into the uncertainty of our estimates of individual player WAR.

For a full rundown you should read their detailed paper, which can be accessed here in PDF format, as well as this presentation from 2013.

The package relies on parsing data from MLB’s Gameday server, which is the same as the pitchRx package above, except that it pulls the results of at-bats instead of every pitch. I have not used open WAR much, but it is on my list to explore in greater detail. That being said, I highly recommend diving into it as it appears to be a great way to not only grab data, but also analyze player performance in a rigorous way.

As cool as these packages are, one can’t live by baseball-themed packages alone. You need some help manipulating the data, and that’s where we will focus next.

Data-manipulation Packages

You probably noticed in some of the code above some additional packages and functions that were not part of the baseball-specific packages. Those I am characterizing as data-manipulation packages and they are every bit as important to conducting any kind of analysis in R, baseball or otherwise.

Now, there are tons of packages one could use to manipulate data in R. Here, I’ll outline a few I find most useful on a day-to-day basis.

Connecting to Databases

Let’s say you have a database, either on your hard disk or one you connect to remotely, that you want to interact with from with the R environment. There are a few packages you could leverage, but the one I currently use is RMySQL. RMySQL allows you to establish a connection to your database and then perform regular SQL queries on the data to your heart’s content.

As an example, assume you have the Lahman database installed on your computer already. Rather than fire up your favorite SQL tool, run a query, export the data, and then import into R for analysis you can simply do all of this from within R.

First, you need to provide your connection information and save that as an object–we’ll call it con:

thirty_thirty <- dbGetQuery(con, "SELECT CONCAT(m.nameLast, ', ', m.nameFirst) as 'Player', yearID as 'Season', teamID as 'Team', HR, SB FROM batting b JOIN Master m ON b.playerID=m.playerID WHERE HR >= 30 AND SB >= 30 ORDER BY yearID DESC")

# make sure to close your connection and detach the package from your environment before using another SQL-like package, like sqldf below

dbDisconnect(con)

detach("package:RMySQL")

We now have a data set with 58 cases, which align to all instances in baseball history where a player has amassed 30 home runs and 30 stolen bases in a single season. Here are the first six records:

head(thirty_thirty)

            Player Season Team HR SB
1      Braun, Ryan   2012  MIL 41 30
2      Trout, Mike   2012  LAA 30 49
3       Kemp, Matt   2011  LAN 39 40
4     Kinsler, Ian   2011  TEX 32 30
5      Braun, Ryan   2011  MIL 33 33
6 Ellsbury, Jacoby   2011  BOS 32 39

Data Munging

The vast majority of most analysis consists of data acquisition, and more importantly, data munging–essentially, cleaning and manipulating the data into the right form for whatever particular analysis you want to conduct.

Sometimes you already have a data set, or multiple data sets, loaded into R that are not accessible in some sort of database and you need to merge them together. For example, let’s say you had a table of player names along with some type of player IDs, and another table with player statistics but no names. This is a very simplified example, but one we run into all the time. Just look at our RMySQL example above. We needed to join the player name from the Master table to the player’s performance data from the Batting table. If you aren’t working in a database you might just pull open both tables in Excel and use the VLOOKUP function to merge the two. But if you have them in R you can just use the SQL syntax you are used to by leveraging the sqldf package.

Sqldf is an easy-to-use package that allows you to manipulate separate data objects in R as if they were tables in a database. For simplicity’s sake, assume you have the Lahman Master and Batting tables downloaded as csv files and you’ve uploaded both into R. Recreating the 30-30 club dataset above is incredibly easy with sqldf:

thirty_thirty_sqldf <- sqldf("SELECT m.nameLast||', '||m.nameFirst as 'Player', yearID as 'Season', teamID as 'Team', HR, SB FROM Batting b JOIN Master m ON b.playerID=m.playerID WHERE HR >= 30 AND SB >= 30 ORDER BY yearID DESC")

head(thirty_thirty_sqldf)
            Player Season Team HR SB
1      Braun, Ryan   2012  MIL 41 30
2      Trout, Mike   2012  LAA 30 49
3      Braun, Ryan   2011  MIL 33 33
4 Ellsbury, Jacoby   2011  BOS 32 39
5       Kemp, Matt   2011  LAN 39 40
6     Kinsler, Ian   2011  TEX 32 30

Presto! Can’t get much easier than that.

The granddaddy of them all, however, is arguably Hadley Wickham’s dplyr package. As much as I loved using sqldf, someone told me that eventually dplyr would become my go-to package for almost any analysis and they were right. With dplyr you can filter and slice data, select and reorder columns and variables, group and summarize data, and join data sets in much the same way you would using SQL-style queries. It is the Swiss army knife of R for data junkies.

The most important thing to know about dplyr is how to use the pipe operator, or %>%. The pipe operator simply takes whatever value is on its left and pipes it to the first position on to its right, or wherever you place a period.

Let’s return to our 30-30 club data set. First, we can create that data set from scratch with dplyr:

require(dplyr)

thirty_thirty_dplyr <- filter(Lahman::Batting, HR >= 30, SB >= 30) %>% left_join(Lahman::Master, by = "playerID") %>% arrange(desc(yearID)) %>% mutate(Player = paste(nameLast, nameFirst, sep = ", ")) %>% select(Player, yearID, teamID, HR, SB)

head(thirty_thirty_dplyr)

            Player yearID teamID HR SB
1      Braun, Ryan   2012    MIL 41 30
2      Trout, Mike   2012    LAA 30 49
3      Braun, Ryan   2011    MIL 33 33
4 Ellsbury, Jacoby   2011    BOS 32 39
5       Kemp, Matt   2011    LAN 39 40
6     Kinsler, Ian   2011    TEX 32 30

We can then look at which players appeared the most times on the list:

count <- thirty_thirty_dplyr %>% group_by(Player) %>% summarise(Count = n()) %>% arrange(desc(Count))

count

               Player Count
1        Bonds, Barry     5
2        Bonds, Bobby     4
3    Soriano, Alfonso     4
4     Johnson, Howard     3
5        Abreu, Bobby     2
6       Bagwell, Jeff     2
7         Braun, Ryan     2
8           Gant, Ron     2
9  Guerrero, Vladimir     2
10       Kinsler, Ian     2
..                ...   ...

Or we can look at which seasons produced the most 30-30 players:

count_season <- thirty_thirty_dplyr %>% group_by(yearID) %>% summarise(Count = n()) %>% arrange(desc(Count))

count_season

  yearID Count
1    1987     4
2    1996     4
3    1997     4
4    2011     4
5    2001     3
6    2007     3
7    1990     2
8    1991     2
9    1995     2
10   1998     2
..    ...   ...

Or the teams with the most 30-30 seasons:

count_team <- thirty_thirty_dplyr %>% group_by(teamID) %>% summarise(Count = n()) %>% arrange(desc(Count))

count_team

 teamID Count
1     NYN     5
2     SFN     5
3     ATL     3
4     CIN     3
5     COL     3
6     LAN     3
7     NYA     3
8     PHI     3
9     TEX     3
10    CHN     2
..    ...   ...

One of my favorite uses for dplyr is for creating year-to-year data sets when I want to compare player performance or create aging curves.

As a quick example, let’s say we want to see which players saw the greatest increase in their home run rate (home runs per 600 balls in play) between 2000 and 2010. We can use Lahman and dplyr to pull this together pretty easily. We will limit the data set to those that had at least 400 at-bats in both the first and second season:

hr_y2y <- filter(Lahman::Batting, yearID >= 2000, yearID < 2011) %>% left_join(Lahman::Master, by = "playerID") %>% arrange(desc(yearID)) %>% mutate(Player = paste(nameLast, nameFirst, sep = ", ")) %>% select(Player, yearID, teamID, HR, AB, SO, SF) %>% mutate(HR_rate = round((HR/(AB+SF-SO)*600),1)) %>% filter(AB >= 400) %>% mutate(Season_next = yearID + 1) %>% left_join(., ., by = c("Season_next" = "yearID", "Player" = "Player")) %>% filter(!is.na(HR.y)) %>% mutate(HR_rate_change = (HR_rate.y - HR_rate.x)) %>% arrange(desc(HR_rate_change)) %>% select(Player, yearID, teamID.x, HR.x, HR_rate.x, Season_next, teamID.y, HR.y, HR_rate.y, HR_rate_change)

head(hr_y2y)

           Player yearID teamID.x HR.x HR_rate.x Season_next teamID.y HR.y
1    Bonds, Barry   2000      SFN   49      71.7        2001      SFN   73
2 Beltran, Carlos   2005      NYN   16      19.5        2006      NYN   41
3 Ensberg, Morgan   2004      HOU   10      16.3        2005      HOU   36
4  Gonzalez, Luis   2000      ARI   31      34.1        2001      ARI   57
5      Hall, Bill   2005      MIL   17      25.4        2006      MIL   35
6      Thome, Jim   2000      CLE   37      56.8        2001      CLE   49
  HR_rate.y HR_rate_change
1     113.8           42.1
2      58.9           39.4
3      52.4           36.1
4      64.4           30.3
5      55.4           30.0
6      85.5           28.7

Man, there’s that Barry Bonds character again.

You get the picture. Bottom line, there a million ways to leverage dplyr and once you get up to speed on its functions you’ll be amazed how much easier it makes your life.

Scraping Data

No matter how robust your own database, there are usually more data you’d like to have access to. Take team records on every date in a given season. The best source for this I have seen for this is Baseball-Reference’s “Standings on Any Date” feature. But what if you want every day from Opening Day until the playoffs? That’s a lot of manual work. Here’s where R and a package like XML can come in very handy.

First, take a look at the url for any date, say opening day this year:

http://www.baseball-reference.com/games/standings.cgi?year=2015&month=4&day=5&submit=Submit+Date

All we need to do is change the year, month and day entries in the url to jump to another date. XML will allow you to scrape all data tables present at a given url, and then select the table you want. Here’s what this looks like in R if we want to pull the National League East standings for Aug. 31, 2015:

p_load(XML, dplyr)

dat <- readHTMLTable("http://www.baseball-reference.com/games/standings.cgi?year=2015&month=08&day=31&submit=Submit+Date")

## here are the divisions and corresponding elements in the list

# 2 AL EAST
# 3 AL CENTRAL
# 4 AL WEST
# 5 NL EAST
# 6 NL CENTRAL
# 7 NL WEST

dat[5]

[[1]]
   Tm  W  L W-L%   GB  RS  RA pythW-L%
1 NYM 73 58 .557   -- 533 478     .550
2 WSN 66 64 .508  6.5 555 525     .525
3 ATL 54 77 .412 19.0 475 615     .384
4 MIA 53 79 .402 20.5 485 548     .444
5 PHI 52 80 .394 21.5 502 666     .373

Now, what really saves time is creating a list of dates and then letting R do the work of pulling all the records for each date for you. Behold:

# create a function for scraping the data given a specific date
date_scrape <- function(y,m,d) {
  url <- paste0("http://www.baseball-reference.com/games/standings.cgi?year=",y,"&month=",m, "&day=",d,"&submit=Submit+Date")
  d <- readHTMLTable(url, stringsAsFactors = FALSE)
  d <- as.data.frame(d[5])
  d
}

# create a complete sequence of dates you want to scrape data for
dates <- as.data.frame(seq(as.Date("2015/04/05"), as.Date("2015/08/31"), by = "days"))
names(dates) <- "dates" 

# split the dates so that there are three separate inputs to feed the function
dates <- colsplit(dates$dates, "-", c("y", "m", "d"))

# use the do() function to iterate the scrape function over all the dates

out <- dates %>% group_by(y,m,d) %>% do(date_scrape(.$y, .$m, .$d))

# view the first 10 rows
head(out, 10)

Source: local data frame [10 x 11]
Groups: y, m, d

      y m d  Tm W L  W.L.  GB RS RA pythW.L.
1  2015 4 5 MIA 0 0  .000  --  0  0         
2  2015 4 5 PHI 0 0  .000  --  0  0         
3  2015 4 5 WSN 0 0  .000  --  0  0         
4  2015 4 5 ATL 0 0  .000  --  0  0         
5  2015 4 5 NYM 0 0  .000  --  0  0         
6  2015 4 6 ATL 1 0 1.000  --  2  1     .780
7  2015 4 6 NYM 1 0 1.000  --  3  1     .882
8  2015 4 6 MIA 0 1  .000 1.0  1  2     .220
9  2015 4 6 PHI 0 1  .000 1.0  0  8     .000
10 2015 4 6 WSN 0 1  .000 1.0  1  3     .118

You now have a data set with 745 rows, one each for every team’s record on every date in your sequence.

Visualizing the Data

There are several books dedicated to using R for creating visualizations. Here I’ll just touch on my go-to package, which not surprisingly is ggplot2 — it is widely hailed as the best visualization package for R. The base of R does include various plotting tools, but ggplot2 gives you a ton of power over just about every aspect of the visual you want to create. The code does take some getting used to, but once you get the hang of it you can do some amazing stuff.

For now, let’s say we want to take and visualize all the National League East standings data. Here’s how you might approach it using our existing data, dplyr and ggplot2:

require(ggplot2)

# pair down the data set and create a single column with the date of the standings
nle_standings_2015 <- ungroup(out) %>% mutate(Date = paste0(y, sep = "-", m, sep = "-", d)) %>% select(Date, Tm, GB) 

# change the data type for the three columns
nle_standings_2015$GB <- as.numeric(nle_standings_2015$GB) 
nle_standings_2015$Date <- as.Date(nle_standings_2015$Date)
nle_standings_2015$Tm <- as.factor(nle_standings_2015$Tm)

# make sure when a team is in first it has a 0 for the games back value
nle_standings_2015$GB <- ifelse(is.na(nle_standings_2015$GB), 0, nle_standings_2015$GB)

# set the color scheme for the teams
team_colors = c("ATL" = "#01487E", "MIA" = "#0482CC", "NYM" = "#F7742C", "PHI" = "#CA1F2C", "WSN" = "#575959")

# plot the data using ggplot2
plot <- ggplot(nle_standings_2015, aes(Date, GB, colour = factor(Tm), group = Tm)) + geom_line(size = 1.25, alpha = .75) + scale_colour_manual(values = team_colors, name = "Team") + scale_y_reverse(breaks = 0:25) + scale_x_date() + labs(title = "NLE East Race through August 2015") + geom_text(aes(label=ifelse(Date == "2015-08-31", as.character(GB),'')),hjust=-.5,just=0, size = 4, show_guide = FALSE) + theme(legend.title = element_text(size = 12)) + theme(legend.text = element_text(size = 12)) + theme(axis.text = element_text(size = 13, face = "bold"), axis.title = element_text(size = 18, color = "grey50", face = "bold"), plot.title = element_text(size = 35, face = "bold", vjust = 1))

# view the graphic
plot

And here’s the result:

We could also plot our PITCHf/x data from earlier. The pitchRx package does have some native graphic options, but we can create our own just for practice. Let’s plot the location of each of deGrom’s swinging strikes from that May 21 start, and color code each pitch by velocity:

# subset the data, keeping all rows but only columns number 1 through 5 and 13

deGrom <- data[,c(1:5, 13)]

# filter for swinging strikes

deGrom_swing <- filter(deGrom, grepl("Swinging", des)) 

# plot the pitches, coloring them by velocity

p <- ggplot(deGrom_swing, aes(px, pz, color = start_speed))

# add in customized axis and legend formatting and labels

p <- p + scale_x_continuous(limits = c(-3,3)) + scale_y_continuous(limits = c(0,5)) + annotate("rect", xmin = -1, xmax = 1, ymin = 1.5, ymax = 3.5, color = "black", alpha = 0) + labs(title = "Jacob deGrom: Swinging Strikes, 5/21/2015") + ylab("Horizontal Location (ft.)") + xlab("Vertical Location (ft): Catcher's View") + labs(color = "Velocity (mph)")

# format the points

p <- p + geom_point(size = 10, alpha = .65)

# finish formatting

p <- p + theme(axis.title = element_text(size = 15, color = "black", face = "bold")) + theme(plot.title = element_text(size = 30, face = "bold", vjust = 1)) + theme(axis.text = element_text(size = 13, face = "bold", color = "black")) + theme(legend.title = element_text(size = 12)) + theme(legend.text = element_text(size = 12))

# view the plot

p

And the result:

We can see that the velocity of the pitches that generated swinging strikes is directly related to how high in the zone they were. This makes sense when we see what pitch types were thrown to which locations:

Those low swinging strikes were generated off of curveballs, and the higher strikes were four-seam fastballs.

Wrapping Up

I hope this is helpful, especially to those who are new to using R and thinking about how to effectively conduct baseball research using the language. You can find all the code, images, and the openWAR 2015 data file at my GitHub repository for this post. I also have a number of public repositories that include R code for other baseball-related projects, so feel free to have a look around.

There is a lot more I could have covered, specifically inferential statistics, modeling and machine learning. If it’s useful, I might cover those packages and techniques in a follow-up post. Let me know in the comments. And feel free to suggest other packages I may have missed or should consider diving into further, as well as any code improvements.

References & Resources

Bill Petti’s Github repository
Max Marchi and Jim Albert, Analyzing Baseball Data With R
The Comprehensive R Archive Network (CRAN)
RStudio
Carson Sievert, “pitchRx” data package
Richard Scriven, “retrosheet” data package
Ben Baumer, Shane Jensen and Gregory Matthews, “openWAR” data package
Hadley Wickham, “ggplot2” data package
CRAN, “Introduction to dplyr”
CRAN, Lahman data package PDF
CRAN, RMySql data package PDF
CRAN, sqldf data package PDF
Dan Kopf, Priceonomics, “Hadley Wickham, the Man Who Revolutionized R”
Atmajitsinh Gohil, R Data Visualization Cookbook
Winston Chang, R Graphics Cookbook
Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis (Use R!)
Nathan Yau, Visualize This: The FlowingData Guide to Design, Visualization, and Statistics

14 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

M. G. Moscato

8 years ago

Excellent starter article for R newbies. And very appropriate mention of the Analyzing Baseball Data with R book; in fact, in addition to the packages, there are some other nicely helpful resources here and in the references. And I hadn’t heard about that edX course either. Thank you!

Bill Pettimember

Reply to M. G. Moscato

Can’t recommend Andy’s course enough–you won’t be disappointed.

Richard

Hadley Wickham recently added a package, Rvest, for web scraping. Used it to collect and munge some box scores from Baseball Reference and it worked great!

Reply to Richard

I’ve used rvest sparsely at this point, just because I am so used to XML, but it’s on my list to dive into as it appears to have some definite advantages.

Bryan Herr

This is fantastic! I am definitely going to get Analyzing Baseball Data with R. I have taken the edX Sabr101x twice and am looking forward to Sabr201x. Thanks for posting this.

Reply to Bryan Herr

Sure thing, glad it’s helpful.

BenDrozdoff

This is one of the best articles I’ve ever read. Thank you very much, Bill!

Reply to BenDrozdoff

My pleasure.

Andrea

Excellent, Thank You!! I did sign up for the EDX course, and although I did not finish it on time, I am working my way through it as i have time. I agree that it is great and the instruction is excellent.
Thanks again for this tremendous information.
Andrea

Martin Alonso

Nice article, have really enjoyed it and can’t stress how equally important SABR101x and Analyzing baseball data with R have helped me. Just one question: I’m trying to install the openWAR package but am failing miserably. What version of R are you running?

Martin

Reply to Martin Alonso

Never mind, found out that most of the packages were incompatible with Windows so I ended up downloading Linux. Now I get to do more cool graphics and analysis. Thanks a lot for this intro!

Andy

I installed R on an older Mac book, but I can’t do it on a newer laptop with Yosemite system. I get an error message.

I installed R on an older Mac book, but I can’t do it on a newer laptop with Yosemite system. I get an error message. I tried to transfer the file from the one Mac to the other, and it worked, but the file will not open on the other Mac. There is some incompatibility with the Yosemite system.

Boy, do I rue the day I installed that system. Several applications I used a lot no longer work on that system.

Jon

7 years ago

Posting a while after the article, but got a question: For the PITCHf/x example, you’re selecting particular columns, but I know there’s more data about vertical and horizontal movement, spin-rate, and release point. Is there a resource that says what each of those columns are and what they represent? Thanks

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU