Developing The baseballr Package For R

When this package is finished, it will hopefully be a mighty tool for baseball researchers and analysts.

When this package is finished, it will hopefully be a mighty tool for baseball researchers and analysts.

Introduction

Late in 2015 I wrote a piece here at The Hardball Times that walked through some of my favorite R packages for gathering and analyzing baseball data. Like all things, no single package has everything I need, nor should it. Following that article, I started collecting various functions that I’ve written and routinely use and decided to compile them in a formal package that anyone can easily load and use.

I’ve never written an R package before, so this is partly an excuse for me to learn a new skill. That means the development of the package will be slow, and have its fair share of bumps along the way. I thought I would share some initial views of the kind of functions I plan to include.

Data Acquisition

I featured a number of packages in my previous article that focused on grabbing data, whether it was full season data for individual players and teams, or pitch-by-pitch data. However, right now there isn’t a package that makes it easy to pull real-time data on players during the season. The Lahman package is great, but that database is only updated once a year after the season. As a writer at The Hardball Times I have direct access to our database, but not everyone does. FanGraphs make it very easy to download leaderboards in CSV format that include dozens of statistics for players updated daily, but there isn’t an easy way to grab that data from within R.

I am creating a series of functions that do just that.

First, there will be functions for downloading some of the most useful reference tables–FanGraph’s historical park factors and the yearly constants and coefficients for calculating things like wOBA or FIP.

Here’s an example of the park factors function for the 2010 season. All you need to do is include a year when you call the fg_park() function and you get the following output:

> fg_park(2010)
   season    home_team basic single double triple  hr  so UIBB  GB  FB  LD IFFB FIP
5    2010       Angels    96    100     96     87  96  99   96  99 100 100   96  98
6    2010      Orioles   103    102    101     89 110  99   99 101 100 100   97 104
7    2010      Red Sox   105    101    116    100  97  99  100 101 100 103  103  99
8    2010    White Sox   105     98     99     89 113 102  107  98 102  99  101 105
9    2010      Indians    97    100    101     83  94 101  100 101  98 103  102  98
10   2010       Tigers   102    102     99    113 101  97   97 101 103  96  104 101
11   2010       Royals   101    103    103    115  93  97  100 103 100 100   92  99
12   2010        Twins   101    102    102    110  94  98  100 101  99 103  101  99
13   2010      Yankees   103    100     97     85 110 100  101 100 100  98  100 104
14   2010    Athletics    97     98     96     98  94  99  100  98 101 101  103  98
15   2010     Mariners    94     99     94     88  92 103  102  98  99  98  104  97
16   2010         Rays    95     98     95    107  95 100   99  97 100  99  110  97
17   2010      Rangers   107    102    104    118 110  99  101 100 101 103   99 104
18   2010    Blue Jays   101     98    103    117 104 102   99 100 100 100   98 101
19   2010 Diamondbacks   106    101    108    121 104  99   98 101  99 100   96 101
20   2010       Braves   100    101     99     99  97 102  102 100  99 103   95  99
21   2010         Cubs   103    101    100     99 102 100  101 100 101  98  100 101
22   2010         Reds   102     99     97    100 111 101  100  99 101  98  102 103
23   2010      Rockies   113    106    108    123 114  96  100 104  99 108   94 106
24   2010      Marlins   101    100    102    102  97 106  105  99  99  99  104  98
25   2010       Astros    99     99    100    100 104 102  101 100 100  98   99 101
26   2010      Dodgers    95     99     96     77  98 101   97  99 100  97  105  98
27   2010      Brewers   100     98    100    104 106 103  102  98 102 100  101 102
28   2010    Nationals   100    101    101     99 100  97   97 102 100 102   99 101
29   2010         Mets    97     99     97    109  93 101  101  99 102  97  108  98
30   2010     Phillies   100     99    100     92 102 101  100  99  99 101  100 100
31   2010      Pirates    97    100    101     93  92  95   97 102 100 102   98  98
32   2010    Cardinals    97    100     97     92  92  99  100 100  99 101  100  98
33   2010       Padres    92     97     93    105  89 103  103 100  98  97   97  97
34   2010       Giants    96     99    100    102  91  99   98 101  96 100   99  97

Sometimes I like to pull team data such as their schedule and record (which is very helpful for my “team consistency” work). Baseball-Reference is the easiest site to acquire this from, so I created a function that allows you to specify the team and year and get back detailed information about the outcome of each of their games.

Using the team_results_bref() function, here’s what the first 10 games of Houston’s 2015 schedule and results would look like:

> head(team_results_bref("HOU", 2015),10)
   Rk Gm#              Date  Tm H_A Opp Result R RA Inn Record Rank   GB      Win          Loss      Save Time D/N Attendance Streak
1   1   1     Monday, Apr 6 HOU   H CLE      W 2  0  NA    1-0    1 Tied  Keuchel        Kluber Gregerson 2:30   N      43753      1
2   2   2  Wednesday, Apr 8 HOU   H CLE      L 0  2  NA    1-1    3  0.5 Carrasco       Feldman     Allen 2:40   N      23078     -1
3   3   3   Thursday, Apr 9 HOU   H CLE      L 1  5  NA    1-2    4  1.0    Bauer Wojciechowski           3:08   D      22593     -2
4   4   4    Friday, Apr 10 HOU   A TEX      W 5  1  NA    2-2    2  0.5   McHugh       Holland           2:45   D      48885      1
5   5   5  Saturday, Apr 11 HOU   A TEX      L 2  6  NA    2-3    3  0.5 Gallardo     Hernandez           3:18   N      36833     -1
6   6   6    Sunday, Apr 12 HOU   A TEX      W 6  4  14    3-3    1 Tied   Harris       Verrett    Deduno 4:24   D      35276      1
7   7   7    Monday, Apr 13 HOU   H OAK      L 1  8  NA    3-4    2  0.5   Kazmir       Feldman           2:51   N      19279     -1
8   8   8   Tuesday, Apr 14 HOU   H OAK      L 0  4  NA    3-5    3  1.5 Graveman       Peacock           2:58   N      18935     -2
9   9   9 Wednesday, Apr 15 HOU   H OAK      W 6  1  NA    4-5    2  0.5   McHugh      Pomeranz           2:42   N      19777      1
10 10  10    Friday, Apr 17 HOU   H LAA      L 3  6  NA    4-6    4  1.0    Ramos        Qualls    Street 2:57   N      22660     -1

Finally, it’s fairly easy to get player performance data for many standard splits, such as by month or by pitcher handedness. But we may want to grab information over a very specific time frame; say, batter performance from August 10, 2015 through the end of the 2015 season. Without access to a game-by-game database this would be impossible, or just incredibly time consuming if you wanted to compile it by hand.

The daily_batter_bref() function makes this very simple. All you need to pass to the function is the first and last date you are interested in. The function will then pull batter performance only over this time frame from Baseball-Reference (the first six records are shown below):

> x <- daily_batter_bref("2015-08-10", "2015-10-04")
> head(x)
           Name Age  Level          Team  G  PA  AB  R  H X1B X2B X3B HR RBI BB IBB SO HBP SH SF
1 Shin-Soo Choo  32 MLB-AL         Texas 52 237 191 45 66  46  11   1  8  32 37   1 46   7  1  1
2 Manny Machado  22 MLB-AL     Baltimore 52 234 205 31 52  32   9   0 11  29 26   1 40   2  0  1
3    Adam Eaton  26 MLB-AL       Chicago 50 230 203 31 66  51   9   1  5  29 18   1 45   5  2  2
4  Kole Calhoun  27 MLB-AL   Los Angeles 52 229 213 27 46  31   4   1 10  23 12   0 61   3  0  1
5  Mookie Betts  22 MLB-AL        Boston 48 228 209 40 71  44  17   2  8  29 16   0 31   1  0  2
6    Matt Duffy  24 MLB-NL San Francisco 51 227 212 29 58  46   8   1  3  26 14   0 33   0  0  1
  GDP SB CS    BA   OBP   SLG   OPS
1   1  2  0 0.346 0.466 0.539 1.005
2   5  5  3 0.254 0.342 0.459 0.800
3   1  7  4 0.325 0.390 0.453 0.844
4   2  0  0 0.216 0.266 0.385 0.651
5   1  8  2 0.340 0.386 0.555 0.941
6   8  7  0 0.274 0.317 0.363 0.680

Metric Calculation

FanGraphs and Baseball-Reference do the hard work of calculating some of the most commonly used advanced metrics for visitors. However, there are times when you might want to calculate some of these metrics yourself.

Let’s take our last example, where you have data over a very specific time frame. FanGraphs doesn’t produce wOBA or wRC+ for custom time frames, but there is nothing stopping you calculating statistics like these as long as you have the basic data.

The function below will (eventually) calculate wOBA, wRC, and wRC+ for any player over any timeframe, so long as you feed it the requisite data. For now, the function will only calculate wOBA (hey, I’m working on it).

A Hardball Times Update
Goodbye for now.

As an example, let’s say you want to know the wOBA for players from August 10, 2015, through the end of the regular season. It’s a snap as long as you have the data in the right format. We can just feed the woba_plus() function the data we just scraped. Here I am just showing the top-15 players by their wOBA:

> x <- daily_batter_bref("2015-08-10", "2015-10-04")
> df <- woba_plus(x)
> filter(df, PA > 100) %>% .[,c(2,43)]
                  Name  wOBA
1    Edwin Encarnacion 0.492
2          David Ortiz 0.470
3           Joey Votto 0.463
4         Bryce Harper 0.459
5          Chris Davis 0.443
6        Shin-Soo Choo 0.439
7     Francisco Lindor 0.435
8   Franklin Gutierrez 0.431
9        Jose Bautista 0.426
10      Ryan Zimmerman 0.424
11        Corey Seager 0.421
12          Mike Trout 0.415
13      Starlin Castro 0.413
14        A.J. Pollock 0.412
15      Mike Moustakas 0.412

I am also planning to include functions that will calculate some of the custom metrics that I have developed and co-developed over the years. Take team consistency, for example. If someone wants to know how consistent each team was in terms of their run scoring and run prevention in 2015 they can easily calculate that with the team_consistency() function:

> team_consistency(2015)
   Team Con_R Con_RA Con_R_Ptile Con_RA_Ptile
1   ARI  0.37   0.36          22           15
2   ATL  0.41   0.40          87           67
3   BAL  0.40   0.38          70           42
4   BOS  0.39   0.40          52           67
5   CHC  0.38   0.41          33           88
6   CHW  0.39   0.40          52           67
7   CIN  0.41   0.36          87           15
8   CLE  0.41   0.40          87           67
9   COL  0.35   0.34           7            3
10  DET  0.39   0.38          52           42
11  HOU  0.39   0.36          52           15
12  KCR  0.37   0.39          22           50
13  LAA  0.40   0.38          70           42
14  LAD  0.37   0.43          22           98
15  MIA  0.41   0.37          87           30
16  MIL  0.40   0.36          70           15
17  MIN  0.38   0.41          33           88
18  NYM  0.41   0.40          87           67
19  NYY  0.41   0.38          87           42
20  OAK  0.38   0.41          33           88
21  PHI  0.39   0.37          52           30
22  PIT  0.39   0.36          52           15
23  SDP  0.42   0.36         100           15
24  SEA  0.35   0.41           7           88
25  SFG  0.39   0.40          52           67
26  STL  0.37   0.43          22           98
27  TBR  0.36   0.40          13           67
28  TEX  0.39   0.40          52           67
29  TOR  0.35   0.37           7           30
30  WSN  0.41   0.40          87           67

 You can play with the individual functions, or install the development version of the package using devtools. See here for instructions.

Next Steps

All of the development can be tracked on GitHub, including the development version of the package. My plan is to flesh out additional data acquisition functions largely through existing application program interfaces (API’s) or scraping of websites. Additional metrics will be added, specifically the ability to calculate things like wOBA on contact, wOBA per pitch based on PITCHf/x data, calculating Edge% from PITCHf/x data, and individual player consistency/volatility. I am also toying with some visualization functions as well, but more on those later.

Feel free to send suggestions or requests along, especially any feedback on the draft versions of the functions (which will be housed here). I can’t promise I will be able to incorporate all of them (or even most of them), but I will certainly do what I can.


Bill leads Predictive Modeling and Data Science consulting at Gallup. In his free time, he writes for The Hardball Times, speaks about baseball research and analytics, has consulted for a Major League Baseball team, and has appeared on MLB Network's Clubhouse Confidential as well as several MLB-produced documentaries. He is also the creator of the baseballr package for the R programming language. Along with Jeff Zimmerman, he won the 2013 SABR Analytics Research Award for Contemporary Analysis. Follow him on Twitter @BillPetti.
17 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Micah Daley-Harris
8 years ago

You may already be planning on doing this but I’d love if you could include count splits.

Micah Daley-Harris
8 years ago
Reply to  Bill Petti

Yes

Martin
8 years ago

Awesome package! I was just thinking on working on similar functions that do exactly this for a piece I am planning to write. You’ve saved me a lot of time! Keep it up.

jvetter
8 years ago

Great work! Very excited to see this come to fruition!

Matt K
8 years ago

You are doing the Lord’s work.

Jirge
8 years ago

I could kiss you, if you were into that sort of thing.

obsessivegiantscompulsive
8 years ago

Good timing! I’m starting to learn R for work, and I can actually recognize some of the command lines above! Exciting! Look forward to playing around with this package.

blodgsion
8 years ago

Out of curiosity, did you consider any of the Python/Anaconda stack?

tz
8 years ago

Thanks for nothing Bill! Now I HAVE to get off my butt and finally learn R.

an Rlifer
8 years ago

This requires a package? These functions are seriously easy.

Rylan
8 years ago

I love this! Thank you for doing this – really cool.

this may be tough but I often have trouble finding batted ball data by pitch type and splits. Also pitch framing and defensive shift data would be awesome to have.

Tony
8 years ago

The timing is impeccable. I focused this weekend on trying to put my programming chops into actually getting into computational baseball analysis – i.e., learning R, finally, looking at ways at acquiring data from various sources, and just seeing what could be done with a little creativity, the data and the tools to question the data.

Thank you for putting this together and for posting this.

Brent
8 years ago

I am pretty R comfortable but o/w completely computer illeterate so forgive me if this is a simple question by I have been trying to scrape the Zips/Fans/Steamer projections in to R from fg with no success. Of course getting the data in R once is trivial but it would be great to be able to scrape the ROS projections daily. Friends have pointed me to tutorials/blogposts on APIs and scarping but they are 1) all in python and 2) seem to require knowledge of JAVA or the specific website in question. I think this might be an interesting add. Regardless, heading to devtools to download now…

Andrea
8 years ago

I LOVE your tools and really appreciate you. The first time I recall seeing your name was at the Tableau Website seeking a little baseball data viz…That was good stuff, and ALL THESE NEW TOOLS are like Christmas Day for me…
Thank you !!!
I am a fan,
Andrea

Dylan
8 years ago

It would be great to see the visualization options implemented in ggplot2 and/or Shiny. Great start!