December 8, 2013

THT Essentials:
 Fangraphs Player Search:

Get It Now!

The tenth Hardball Times Annual is now available. It's got 300 pages of articles, commentary and even a crossword puzzle. You can buy the Annual at Amazon, for your Kindle or on our own page (which helps us the most financially). However you buy it, enjoy!

And here's the full roster.

THT's latest e-book

Third Base: The Crossroads is THT's new e-book, available for \$3.99 from the Kindle store. The good news is that anyone can read a Kindle book, even on a PC. So enjoy the best from THT in a new format.

Get your very own THT merchandise from our CafePress store. We've got baseball caps, t-shirts, coffee mugs and even wall clocks with the classy THT logo prominently displayed. Also, check out the THT Bookstore. Please support your favorite baseball site by purchasing something today.

Search THT:

Or you can search by:

 Roll mouse over date for entries THT Live Calendar << December 2013 >> S M T W T F S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Importing data into R

Posted by Ricky Zanker
Now that you have finished downloading and installing the R software, it is time to try it out.

There are no automatic functions in the R program. So all the data and functions have to be input manually using R code. The data can be entered by creating vectors. A vector can hold data by designating a name for it (the word "data" for example) followed by an angle bracket and then a hyphen. What follows depends on the data importation method. To enter data manually, use the letter c followed by a bracket then a comma-divided row of words or numbers and closed again by a bracket. Here is an example.

data<-c(1,2,3,4)
data<-c("AB","H","R","Albert Pujols")

These are great for inputing a small amount of data. For larger amounts of data, you can import it either from an Excel spreadsheet, CSV file, or a MySQL database. There are other ways to import data from outside R, but these are the three easiest (and the only ones I know). Let's start with Excel since that will be the most common method for those without a database. For R to grab data from an Excel sheet, it needs the RODBC package installed. To do that, click on the Packages extension on the top file bar in R. Then scroll down and click on "Install package(s)...". Then pick a CRAN mirror close to you location. When the window of the list of packages comes up, scroll down to RODBC and click it. If you are using Windows Vista or 7, you need to have the R program running in Administration mode when originally opening the program. I am not sure about other Operating Systems but if the package doesn't install, you will probably need to run the program or OS in administration mode. When the package does install, use the following template to upload your data.

library(RODBC)
channel <- odbcConnectExcel("c:/baseball/baseballdata.xlsx")
data <- sqlFetch(channel, "Sheet1")
odbcClose(channel)

The file location and sheet name will change depending on your specific Excel file. If you don't have the Excel software, you won't be able to use this method even if the file is saved as an Excel file. You would need to save it as a CSV file using whatever spreadsheet software you have. Now to import data from a CSV file, use the following template.

These tables need to have columns with a header in order for the data transfer to work. Also, R isn't too friendly with missing values so make sure that every row has some data in it. A zero or NA should suffice.

And for those with a MySQL database, you can follow these instructions by Jeff Zimmerman on how to connect R with MySQL. The template to for database importation follows like this:

data <- sqlQuery(channel,"query;")

Another method is to create data through a function within R. Those functions will be discussed in later lessons when applicable.

Now that you have loaded up your data into R, it is time to play with it. To view the imported data, just type in the name of the data vector. If you want to use multiple vectors at one time, you would have to give them different names. These names could include numbers, e.g., "data1". If your data vector has multiple columns, you can single out individual columns by typing the vector name followed by a \$ and the title of the column. A quick statistical summary of the columns can be produced by the summary( function.

If you don't enter something correctly, you will normally get an error message after you enter the code. Like in Command Prompt, you can press the up key to get the previous line typed for quick re-edits. A basic scatter plot can be made using the function plot( followed by the data points you want to graph.

plot(data\$'HR/FL',data\$ISO)

If you have multiple columns, R will produce multiple graphs.

More detailed instructions on how to plot graphs will be the subject of the next tutorial.

Tips: (These will follow every article written)
-R is basically a case sensitive interface. So if a file or function needs to be capitalized, capitalize it.
-Use Notepad or some other word processor to save your R code. Always save a copy of code if you want to use it that code again. You will likely forget how to use it and keeping code allows you to use it as a guideline.
-Run in administration mode for Windows in order to install packages.
-R doesn't not like missing values, especially if you want to plot the data. Make sure your data has a value for each row and column.
-There are great instructions to build a MySQL database for PITCHf/x written by Mike Fast, Jeff Zimmerman, Nick Steiner, Josh Hermsmeyer and Sean Smith.
-Raw PITCHf/x data can be collected online at Joe Lefkowitz's site.
-Instructions to build a Baseball Databank MySQL database can be found here at The Hardball Times by Colin Wyers.

Aaron said...

could you make a section on the site that has a link to this whole R tutorial? that way it can be referenced more easily in the future

Posted 12/17  at  04:59 AM
Millsy said...

Good idea with the R-tutorials, Ricky.  I actually just started doing something similar over at my site (click URL in my name) last week.

The RSS feed is going through R-Bloggers as well, so hopefully I can get some good feedback/learn new things from those more R-savvy than I.

Posted 12/17  at  12:10 PM
Millsy said...

Guess the url didn’t work:

Posted 12/17  at  12:29 PM
Millsy said...

(Noticed that you had commented on that article before as rz (right?), so my bad).

BTW, great tip here:

“Use Notepad or some other word processor to save your R code”

This is the single most important thing that new R users forget to do.

You can open a ‘Script’ directly in R and save it as well.  Using this, you can put the cursor on the line of code you want to run, and press “CTRL + R” and it will run it in the Console window.

Posted 12/17  at  01:31 PM
Josh said...

The R gui rkward also helps a lot with stuff like this. You don’t have to open and save files using code in rkward.

Posted 12/17  at  01:38 PM
wcw said...

“no automatic functions”
> chooseCRANmirror()
> install.packages(‘Rcmdr’)
> library(Rcmdr)

R-Commander is hardly a perfect menu system, but it’s a nice training-wheel helper until you get comfortable banging out scripts by hand.

For data import issues, there is a good tutorial at http://cran.r-project.org/doc/manuals/R-data.html  For large data sets, ‘read.csv’ is prohibitively slow.  You will want to use ‘scan’ directly.

If you’re building a Sql db, I strongly recommend Postgres, because onto Postgres alone are you able to bolt pl/r—which will let you use R from within your Sql environment.  See http://www.joeconway.com/plr/

Posted 12/17  at  02:11 PM
chris c said...

1) missing values: my experience is that as long as you rows of data are properly delimited, R is actually pretty intelligent about reading in missing values. you can tinker with the ‘na.strings’ and ‘encoding’ parameters in those functions to get it right.

2) column heading names: you don’t actually NEED column headings for the import to work. if you change the ‘header’ parameter to FALSE, R will read in the data and use an automatic generic naming scheme to name the rows. you can the name the columns using the names() function:

# don’t use “data” to name an R object because it is reserved as a function name
names(baseball) = c(“AB”, “H”, “R”, “HR”)

this won’t work if you try to use a vector of names that isn’t the same length as the number of columns you have. you can use dim() to find out:

# this gives the dimensions of your data frame in as rows, columns
dim(baseball)

3) the summary() function is helpful in getting an idea of the statistical distributions of the columns, and also an easy way to spot if your data has been read incorrectly”

summary(baseball)

Posted 12/17  at  06:09 PM
Millsy said...

For checking how the data is read in, I would also suggest using:

and

tail(baseball)

This will show you the first 6 and last 6 rows of data and show all of your variable names.

Posted 12/17  at  06:12 PM
RZ said...