Building an R Hadoop System

The information provided in this page might be out-of-date. Please see a newer version at Step-by-Step Guide to Setting Up an R-Hadoop System.

This page shows how to build an R Hadoop system, and presents the steps to set up my first R Hadoop system in single-node mode on Mac OS X.

After reading documents and tutorials on MapReduce and Hadoop and playing with RHadoop for about 2 weeks, finally I have built my first R Hadoop system and successfully run some R examples on it. Here I’d like to share my experience and steps to achieve that. Hopefully it will make it easier to try RHadoop for R users who are new to Hadoop. Note that I tried this on Mac only and some steps might be different for Windows.

Before going through the complex steps below, let’s have a look what you can get, to give you a motivation to continue. There is a video showing Wordcount MapReduce in R at http://www.youtube.com/watch?v=hSrW0Iwghtw.

Now let’s start.

1. Install Hadoop

1.1 Download Hadoop

Download Hadoop (hadoop-1.1.2-bin.tar.gz) at http://hadoop.apache.org/releases.html#Download and then unpack it.

1.2 Set JAVA_HOME

In conf/hadoop_env.sh, add the line below:

export JAVA_HOME=/Library/Java/Home

1.3 Set up Remote Desktop and Enabling Self-Login

System Preferences > Sharing (under Internet & Network), Under the list of services, check "Remote Login". For extra security, you can hit the radio button for "Only these Users" and select hadoop.

ssh-keygen -t rsa -P ""

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The above step to set up remote desktop and self-login was picked up from http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29, which provides detailed instructions to set up Hadoop on Mac.

2. Run Hadoop

2.1 Start Hadoop

Go to hadoop directory

cd hadoop-1.1.2

bin/hadoop

Start Hadoop and check whether Hadoop is running

bin/start-all.sh

jps

2.2 Test Hadoop with simple examples

Example 1. to calculate pi

In the code below, the first argument (10) is the number of Maps and the second the number of samples per map. A more accurate value of pi can be obtained by setting a larger value to the second argument, which in turn would take longer to run.

bin/hadoop jar hadoop-examples-*.jar pi 10 100

Example 2. word count

It first copies files from directory "conf" to "input", and then runs distributed "grep" to look for pattern 'dfs[a-z.]+', which matches strings starting with "dfs". Change it to 'df[a-z.]+' or 'd[a-z.]+' to get more results. The code below should return a list of words and their frequencies.

bin/hadoop fs -put conf input

bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

bin/hadoop fs -get output output

cat output/*

2.3 Stop Hadoop

bin/stop-all.sh

Congratulations! Now you have set up a Hadoop system. Next we will install RHadoop packages so that we can run R jobs on the Hadoop system.

3. Install R

Download R from http://www.r-project.org/ and install it. I have successfully run R v1.15.1, v1.15.2 and v3.1 on Hadoop.

4. Install RHadoop

4.1 Install GCC

Download GCC from https://github.com/kennethreitz/osx-gcc-installer and then install it. Without GCC, you will get error “Make Command Not Found” when installing some R packages from source.

4.2 Install Homebrew

The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew. Under Terminal of Mac OS X, run commands below.

su <administrator_account>

ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"

brew update

brew doctor

4.3 Install packages git, pkg-config and thrift

Without thrift, package rhbase will fail to be installed.

brew install git

brew install pkg-config

brew install thrift

4.4 Set environment variable

Environments variables for Hadoop can be set with "export" command in Terminal, or with R functions below.

Sys.setenv("HADOOP_CMD"="<path>/hadoop-1.1.2/bin/hadoop")

Sys.setenv("HADOOP_STREAMING"="<path>/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")

Sys.getenv("HADOOP_CMD")

4.5 Install R package

Install some R packages that RHadoop depends on.

install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2"))

4.6 Install RHadoop packages

Download packages rhdfs, rhbase and rmr2 from https://github.com/RevolutionAnalytics/RHadoop/wiki and then run the R code below.

install.packages("<path>/rhdfs_1.0.6.tar.gz", repos = NULL, type="source")

install.packages("<path>/rhbase_1.2.0.tar.gz", repos = NULL, type="source")

install.packages("<path>/rmr2_2.2.2.tar.gz", repos = NULL, type="source")

To make sure they have been installed successfully, run library() to load the above three packages in R.

4.7 Further information

Prerequisites and installation about rmr are available at https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr#prerequisites-and-installation.

5. Run R jobs on Hadoop

Now you can try run an R job on Hadoop. Below is an example of R MapReduce code for word counting, provided in presentation “Using R with Hadoop” by Jeffrey Breen at http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/.

Sys.setenv("HADOOP_HOME"="<path>/hadoop-1.1.2")

library(rmr2)

map <- function(k,lines) {

words.list <- strsplit(lines, '\\s')

words <- unlist(words.list)

return( keyval(words, 1) )

}

reduce <- function(word, counts) {

keyval(word, sum(counts))

}

wordcount <- function (input, output=NULL) {

mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)

}

## read text files from folder wordcount/data

## save result in folder wordcount/out

## Submit job

hdfs.root <- 'wordcount'

hdfs.data <- file.path(hdfs.root, 'data')

hdfs.out <- file.path(hdfs.root, 'out')

out <- wordcount(hdfs.data, hdfs.out)

## Fetch results from HDFS

results <- from.dfs(out)

results.df <- as.data.frame(results, stringsAsFactors=F)

colnames(results.df) <- c('word', 'count')

head(results.df)

After running the above code, you should see a list of words and their counts.

More examples of R jobs on Hadoop with rmr2 can be found at https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md and https://github.com/RevolutionAnalytics/rmr2/archive/master.zip.

All done! Now you have set up your own R Hadoop system (in a single-node mode). Enjoy MapReducing with R.

6. What's Next

The above Hadoop system is in a standalone mode on a single node. You might want to try the pseudo-distributed mode. And you might even want to run it in a fully-distributed mode on a cluster of computers.

Instructions on setting up Hadoop in cluster mode are provided at Hadoop: From Sing-Node Mode to Cluster Mode.

More details on setting up Hadoop in standalone mode, pseudo-distributed mode and fully-distributed mode can be found at links below.