Late in 2015 I wrote a piece here at The Hardball Times that walked through some of my favorite R packages for gathering and analyzing baseball data. Like all things, no single package has everything I need, nor should it. Following that article, I started collecting various functions that I’ve written and routinely use and decided to compile them in a formal package that anyone can easily load and use.
I’ve never written an R package before, so this is partly an excuse for me to learn a new skill. That means the development of the package will be slow, and have its fair share of bumps along the way. I thought I would share some initial views of the kind of functions I plan to include.
I featured a number of packages in my previous article that focused on grabbing data, whether it was full season data for individual players and teams, or pitch-by-pitch data. However, right now there isn’t a package that makes it easy to pull real-time data on players during the season. The Lahman package is great, but that database is only updated once a year after the season. As a writer at The Hardball Times I have direct access to our database, but not everyone does. FanGraphs make it very easy to download leaderboards in CSV format that include dozens of statistics for players updated daily, but there isn’t an easy way to grab that data from within R.
I am creating a series of functions that do just that.
First, there will be functions for downloading some of the most useful reference tables–FanGraph’s historical park factors and the yearly constants and coefficients for calculating things like wOBA or FIP.
Here’s an example of the park factors function for the 2010 season. All you need to do is include a year when you call the fg_park() function and you get the following output:
Sometimes I like to pull team data such as their schedule and record (which is very helpful for my “team consistency” work). Baseball-Reference is the easiest site to acquire this from, so I created a function that allows you to specify the team and year and get back detailed information about the outcome of each of their games.
Using the team_results_bref() function, here’s what the first 10 games of Houston’s 2015 schedule and results would look like:
Finally, it’s fairly easy to get player performance data for many standard splits, such as by month or by pitcher handedness. But we may want to grab information over a very specific time frame; say, batter performance from August 10, 2015 through the end of the 2015 season. Without access to a game-by-game database this would be impossible, or just incredibly time consuming if you wanted to compile it by hand.
The daily_batter_bref() function makes this very simple. All you need to pass to the function is the first and last date you are interested in. The function will then pull batter performance only over this time frame from Baseball-Reference (the first six records are shown below):
FanGraphs and Baseball-Reference do the hard work of calculating some of the most commonly used advanced metrics for visitors. However, there are times when you might want to calculate some of these metrics yourself.
Let’s take our last example, where you have data over a very specific time frame. FanGraphs doesn’t produce wOBA or wRC+ for custom time frames, but there is nothing stopping you calculating statistics like these as long as you have the basic data.
The function below will (eventually) calculate wOBA, wRC, and wRC+ for any player over any timeframe, so long as you feed it the requisite data. For now, the function will only calculate wOBA (hey, I’m working on it).
As an example, let’s say you want to know the wOBA for players from August 10, 2015, through the end of the regular season. It’s a snap as long as you have the data in the right format. We can just feed the woba_plus() function the data we just scraped. Here I am just showing the top-15 players by their wOBA:
I am also planning to include functions that will calculate some of the custom metrics that I have developed and co-developed over the years. Take team consistency, for example. If someone wants to know how consistent each team was in terms of their run scoring and run prevention in 2015 they can easily calculate that with the team_consistency() function:
You can play with the individual functions, or install the development version of the package using devtools. See here for instructions.
All of the development can be tracked on GitHub, including the development version of the package. My plan is to flesh out additional data acquisition functions largely through existing application program interfaces (API’s) or scraping of websites. Additional metrics will be added, specifically the ability to calculate things like wOBA on contact, wOBA per pitch based on PITCHf/x data, calculating Edge% from PITCHf/x data, and individual player consistency/volatility. I am also toying with some visualization functions as well, but more on those later.
Feel free to send suggestions or requests along, especially any feedback on the draft versions of the functions (which will be housed here). I can’t promise I will be able to incorporate all of them (or even most of them), but I will certainly do what I can.