Friday, December 17, 2010
Importing data into R
Posted by Ricky ZankerNow that you have finished downloading and installing the R software, it is time to try it out.
There are no automatic functions in the R program. So all the data and functions have to be input manually using R code. The data can be entered by creating vectors. A vector can hold data by designating a name for it (the word "data" for example) followed by an angle bracket and then a hyphen. What follows depends on the data importation method. To enter data manually, use the letter c followed by a bracket then a comma-divided row of words or numbers and closed again by a bracket. Here is an example.
data<-c(1,2,3,4)
data<-c("AB","H","R","Albert Pujols")
These are great for inputing a small amount of data. For larger amounts of data, you can import it either from an Excel spreadsheet, CSV file, or a MySQL database. There are other ways to import data from outside R, but these are the three easiest (and the only ones I know). Let's start with Excel since that will be the most common method for those without a database. For R to grab data from an Excel sheet, it needs the RODBC package installed. To do that, click on the Packages extension on the top file bar in R. Then scroll down and click on "Install package(s)...". Then pick a CRAN mirror close to you location. When the window of the list of packages comes up, scroll down to RODBC and click it. If you are using Windows Vista or 7, you need to have the R program running in Administration mode when originally opening the program. I am not sure about other Operating Systems but if the package doesn't install, you will probably need to run the program or OS in administration mode. When the package does install, use the following template to upload your data.
library(RODBC)
channel <- odbcConnectExcel("c:/baseball/baseballdata.xlsx")
data <- sqlFetch(channel, "Sheet1")
odbcClose(channel)
The file location and sheet name will change depending on your specific Excel file. If you don't have the Excel software, you won't be able to use this method even if the file is saved as an Excel file. You would need to save it as a CSV file using whatever spreadsheet software you have. Now to import data from a CSV file, use the following template.
data <- read.csv(file='c:/baseball/baseballdata.csv',sep=',',header=T)
These tables need to have columns with a header in order for the data transfer to work. Also, R isn't too friendly with missing values so make sure that every row has some data in it. A zero or NA should suffice.
And for those with a MySQL database, you can follow these instructions by Jeff Zimmerman on how to connect R with MySQL. The template to for database importation follows like this:
data <- sqlQuery(channel,"query;")
Another method is to create data through a function within R. Those functions will be discussed in later lessons when applicable.
Now that you have loaded up your data into R, it is time to play with it. To view the imported data, just type in the name of the data vector. If you want to use multiple vectors at one time, you would have to give them different names. These names could include numbers, e.g., "data1". If your data vector has multiple columns, you can single out individual columns by typing the vector name followed by a $ and the title of the column. A quick statistical summary of the columns can be produced by the summary( function.
If you don't enter something correctly, you will normally get an error message after you enter the code. Like in Command Prompt, you can press the up key to get the previous line typed for quick re-edits. A basic scatter plot can be made using the function plot( followed by the data points you want to graph.
plot(data$'HR/FL',data$ISO)
If you have multiple columns, R will produce multiple graphs.
More detailed instructions on how to plot graphs will be the subject of the next tutorial.
Tips: (These will follow every article written)
-R is basically a case sensitive interface. So if a file or function needs to be capitalized, capitalize it.
-Use Notepad or some other word processor to save your R code. Always save a copy of code if you want to use it that code again. You will likely forget how to use it and keeping code allows you to use it as a guideline.
-Run in administration mode for Windows in order to install packages.
-R doesn't not like missing values, especially if you want to plot the data. Make sure your data has a value for each row and column.
-There are great instructions to build a MySQL database for PITCHf/x written by Mike Fast, Jeff Zimmerman, Nick Steiner, Josh Hermsmeyer and Sean Smith.
-Raw PITCHf/x data can be collected online at Joe Lefkowitz's site.
-Instructions to build a Baseball Databank MySQL database can be found here at The Hardball Times by Colin Wyers.
Ricky Zanker also writes for Draysbay and can be contacted by email.









could you make a section on the site that has a link to this whole R tutorial? that way it can be referenced more easily in the future