Building an R Hadoop System
The information provided in this page might be out-of-date. Please see a newer version at Step-by-Step Guide to Setting Up an R-Hadoop System.
This page shows how to build an R Hadoop system, and presents the steps to set up my first R Hadoop system in single-node mode on Mac OS X.
After reading documents and tutorials on MapReduce and Hadoop and playing with RHadoop for about 2 weeks, finally I have built my first R Hadoop system and successfully run some R examples on it. Here I’d like to share my experience and steps to achieve that. Hopefully it will make it easier to try RHadoop for R users who are new to Hadoop. Note that I tried this on Mac only and some steps might be different for Windows.
Before going through the complex steps below, let’s have a look what you can get, to give you a motivation to continue. There is a video showing Wordcount MapReduce in R at http://www.youtube.com/watch?v=hSrW0Iwghtw.
Now let’s start.
1. Install Hadoop
1.1 Download Hadoop
Download Hadoop (hadoop-1.1.2-bin.tar.gz) at http://hadoop.apache.org/releases.html#Download and then unpack it.
1.2 Set JAVA_HOME
In conf/hadoop_env.sh, add the line below:
export JAVA_HOME=/Library/Java/Home
1.3 Set up Remote Desktop and Enabling Self-Login
System Preferences > Sharing (under Internet & Network), Under the list of services, check "Remote Login". For extra security, you can hit the radio button for "Only these Users" and select hadoop.
ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The above step to set up remote desktop and self-login was picked up from http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29, which provides detailed instructions to set up Hadoop on Mac.
2. Run Hadoop
2.1 Start Hadoop
Go to hadoop directory
cd hadoop-1.1.2
bin/hadoop
Start Hadoop and check whether Hadoop is running
bin/start-all.sh
jps
2.2 Test Hadoop with simple examples
Example 1. to calculate pi
In the code below, the first argument (10) is the number of Maps and the second the number of samples per map. A more accurate value of pi can be obtained by setting a larger value to the second argument, which in turn would take longer to run.
bin/hadoop jar hadoop-examples-*.jar pi 10 100
Example 2. word count
It first copies files from directory "conf" to "input", and then runs distributed "grep" to look for pattern 'dfs[a-z.]+', which matches strings starting with "dfs". Change it to 'df[a-z.]+' or 'd[a-z.]+' to get more results. The code below should return a list of words and their frequencies.
bin/hadoop fs -put conf input
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
bin/hadoop fs -get output output
cat output/*
2.3 Stop Hadoop
bin/stop-all.sh
Congratulations! Now you have set up a Hadoop system. Next we will install RHadoop packages so that we can run R jobs on the Hadoop system.
3. Install R
Download R from http://www.r-project.org/ and install it. I have successfully run R v1.15.1, v1.15.2 and v3.1 on Hadoop.
4. Install RHadoop
4.1 Install GCC
Download GCC from https://github.com/kennethreitz/osx-gcc-installer and then install it. Without GCC, you will get error “Make Command Not Found” when installing some R packages from source.
4.2 Install Homebrew
The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew. Under Terminal of Mac OS X, run commands below.
su <administrator_account>
ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"
brew update
brew doctor
4.3 Install packages git, pkg-config and thrift
Without thrift, package rhbase will fail to be installed.
brew install git
brew install pkg-config
brew install thrift
4.4 Set environment variable
Environments variables for Hadoop can be set with "export" command in Terminal, or with R functions below.
Sys.setenv("HADOOP_CMD"="<path>/hadoop-1.1.2/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="<path>/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")
Sys.getenv("HADOOP_CMD")
4.5 Install R package
Install some R packages that RHadoop depends on.
install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2"))
4.6 Install RHadoop packages
Download packages rhdfs, rhbase and rmr2 from https://github.com/RevolutionAnalytics/RHadoop/wiki and then run the R code below.
install.packages("<path>/rhdfs_1.0.6.tar.gz", repos = NULL, type="source")
install.packages("<path>/rhbase_1.2.0.tar.gz", repos = NULL, type="source")
install.packages("<path>/rmr2_2.2.2.tar.gz", repos = NULL, type="source")
To make sure they have been installed successfully, run library() to load the above three packages in R.
4.7 Further information
Prerequisites and installation about rmr are available at https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr#prerequisites-and-installation.
5. Run R jobs on Hadoop
Now you can try run an R job on Hadoop. Below is an example of R MapReduce code for word counting, provided in presentation “Using R with Hadoop” by Jeffrey Breen at http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/.
Sys.setenv("HADOOP_HOME"="<path>/hadoop-1.1.2")
library(rmr2)
map <- function(k,lines) {
words.list <- strsplit(lines, '\\s')
words <- unlist(words.list)
return( keyval(words, 1) )
}
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}
## read text files from folder wordcount/data
## save result in folder wordcount/out
## Submit job
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')
out <- wordcount(hdfs.data, hdfs.out)
## Fetch results from HDFS
results <- from.dfs(out)
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')
head(results.df)
After running the above code, you should see a list of words and their counts.
More examples of R jobs on Hadoop with rmr2 can be found at https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md and https://github.com/RevolutionAnalytics/rmr2/archive/master.zip.
All done! Now you have set up your own R Hadoop system (in a single-node mode). Enjoy MapReducing with R.
6. What's Next
The above Hadoop system is in a standalone mode on a single node. You might want to try the pseudo-distributed mode. And you might even want to run it in a fully-distributed mode on a cluster of computers.
Instructions on setting up Hadoop in cluster mode are provided at Hadoop: From Sing-Node Mode to Cluster Mode.
More details on setting up Hadoop in standalone mode, pseudo-distributed mode and fully-distributed mode can be found at links below.
Single-node mode and pseudo-distributed mode on a single node:
Fully-distributed cluster mode: