Useful R Packages

Posted on July 28, 2013

I recently started using R for data analysis. Previously, I’ve used Hadoop/Map Reduce for truly “big data” jobs, SQL and Teradata for more structured data, and python/pandas for anything that I could fit into main memory.

Initial thoughts: stock pandas seems qualitatively faster than stock R, but it’s not hard to make R speedier.1 A big plus for R is the wealth of libraries available for data analysis. Here’s a roundup of the most useful R packages/idioms I’ve come across so far to load, slice and analyze data.

Importing more than ~100MB of data with sqldf

The canonical way to load data into R (say, from a csv file) is with the read.table function. Unfortunately, read.table can be quite slow for data files larger than a few hundred megabytes. Enter, sqldf. I frequently use this simple drop-in solution for reading a GB sized csv file into a data.frame:2

library(sqldf)  
f <- file("my-large-csv-file.csv")  
df <- sqldf("select * from f", dbname = tempfile(),  
            file.format = list(header = T, row.names = F))

Split-apply-combine with plyr

Split-apply-combine is a very common computational paradigm for data analysis.3 (It actually is pretty similar to map/reduce.) The plyr package makes these operations very easy.

To illustrate this, let’s look at the esoph dataset, which relates esophageal cancer to age and alcohol and tobacco usage:
> # number of esophageal cancer cases vs non-cancer controls  
> # by age/alcohol/tobacco usage  
> head(esoph)   
  agegp     alcgp    tobgp ncases ncontrols  
1 25-34 0-39g/day 0-9g/day      0        40  
2 25-34 0-39g/day    10-19      0        10  
3 25-34 0-39g/day    20-29      0         6  
4 25-34 0-39g/day      30+      0         5  
5 25-34     40-79 0-9g/day      0        27  
6 25-34     40-79    10-19      0            
Let’s use ddply to figure out how the proportion of cancer cases changes with tobacco usage. First, we need a function to compute the cancer proportion
> CancerProportion <- function(df) {  
+   cancer.prop <- sum(df$ncases) / sum(df$ncontrols)  
+   data.frame(cancer.prop=cancer.prop)  
+ }
ddply splits a data.frame into groups (i.e. a subset of the original data.frame), applies a function to each group, and then combines the results back into a data.frame
> library(plyr)  
> ddply(esoph,  
+ .(tobgp),  # SPLIT by the tobgp column  
+ CancerProportion) # APPLY this function to each tobacco group  
     tobgp cancer.prop  
1 0-9g/day   0.1485714  
2    10-19   0.2457627  
3    20-29   0.2500000  
4      30+   0.3780488  

So tobacco consumption is positively correlated with esophageal cancer.

Strucured plotting with ggplot2

One of my favorite parts of R is the plotting package ggplot2. It is based on a “grammar of graphics.” You could say that ggplot2 is a domain specific language for plotting. Not only does this make it really easy to make compelling plots with a minimum of keystrokes, but it also provides a rigorous framework for reasoning about plots. A great resource for this is, of course, the ggplot2 book.

Here’s a teaser example:
> # Calculate the cancer proportion by alcohol and tobacco usage  
> alc.tob.cancer <- ddply(esoph, .(alcgp, tobgp), CancerProportion)  
> head(alc.tob.cancer)  
      alcgp    tobgp cancer.prop  
1 0-39g/day 0-9g/day  0.03448276  
2 0-39g/day    10-19  0.11904762  
3 0-39g/day    20-29  0.11904762  
4 0-39g/day      30+  0.17857143  
5     40-79 0-9g/day  0.18994413  
6     40-79    10-19  0.20000000  
> library(ggplot2)
> my.plot <- qplot(x=alcgp, y=tobgp, fill=log(cancer.prop),  
+                  geom='tile', data=alc.tob.cancer)  
> my.plot + labs(x='Alcohol consumption', y='Tobacco Consumption',  
+                title='Cancer by Alcohol and Tobacco')
Made with ggplot2

Made with ggplot2

Workflow

So there you are: load data with sqldf; slice, summarize and analyze it with plyr; plot with ggplot2.

In terms of code structure, I’ve found separting each project into load.R, clean.R, func.R and do.R files to be a nice starting point.4 I often also have a plot.R file.

Finally, a parting note on dev environments. I’m partial to emacs, and highly recommend it’s R integration via ESS (emacs speaks statistics) . If you are unfamiliar with emacs, Rstudio is a nice standalone IDE that will probably be an easier way to get started.