Big Data‎ > ‎

Step-by-Step Guide to Setting Up an R-Hadoop System

30 May 2014

This is a step-by-step guide to setting up an R-Hadoop system. I have tested it both on a single computer and on a cluster of computers. Note that this process is for Mac OS X and some steps or settings might be different for Windows or Ubuntu.

To install Hadoop on Windows, you can find detailed instructions at

Below is a list of software used for this setup.

  • OS and other tools:
    • Mac OS X 10.6.8, Java 1.6.0_65, Homebrew, thrift 0.9.0
  • Hadoop and HBase:
    • Hadoop 1.1.2, HBase 0.94.17
  • R and RHadoop packages:
    • R 3.1.0, rhdfs 1.0.8, rmr2 3.1.0, plyrmr 0.2.0, rhbase 1.2.0

This process should work with Hadoop 2.2 or above and newer versions of HBase as well, but I haven't tested it yet.

Homebrew is a missing package manager for Mac OS X, and it is needed for install git, pkg-config and thrift. For other operating systems, the equivalents to Homebrew are apt-get on Ubuntu and yum on CentOS.

By the way, two painful steps in this process are setting up HBase on Hadoop in cluster mode and installing rhbase. If you want to have a quick start or are not going to use HBase, you donot need to intall thrift, HBase or rhbase, and therefore can skip

  • step 3 - Install HBase,
  • step 5.4 - Install thrift 0.9.0, and
  • installing rhbase at step 7.3.
If you are going to set up RHadoop on Linux, see RHadoop Installation Guide for Red Hat Enterprise Linux.

1. Set up single-node Hadoop

If building a Hadoop system for the first time, you are suggested to start with a stand-alone mode first, and then switch to pseudo-distributed mode and cluster (fully-distributed) mode.

1.1 Download Hadoop

Download Hadoop from http://hadoop.apache.org/releases.html#Download and then unpack it.

1.2 Set up Hadoop in standalone mode

1.2.1 Set JAVA_HOME

In file conf/hadoop_env.sh, add the line below:

export JAVA_HOME=/Library/Java/Home

1.2.2 Set up remote desktop and enabling self-login

Open the “System Preferences” window, and click “Sharing”“ (under "Internet & Wireless”). Under the list of services, check “Remote Login”. For extra security, you can hit the radio button for “Allow access for only these Users” and select your account, which we assume is “hadoop”.

After that, save authorized keys so that you can log in localhost without typing a password.

ssh-keygen -t rsa -P "" cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The above step to set up remote desktop and self-login was picked up from http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29, which provides detailed instructions to set up Hadoop on Mac.

1.2.3 Run Hadoop

After that, run commands below under system console to check whether Hadoop has been installed properly in a stand-alone mode.

## go to hadoop directory cd hadoop-1.1.2 ## see a list of Hadoop commands bin/hadoop ## version of Hadoop bin/hadoop version ## start Hadoop bin/start-all.sh ## check Hadoop is running jps ## stop Hadoop bin/stop-all.sh

After running jps, You should see a list of services below.


Hadoop 1.1.2 Hadoop 2.2.0 or above
master node NameNode NameNode

SecondaryNameNode ResourceManager

JobTracker JobHistoryServer
slave node DataNode DataNode

TaskTracker NodeManager

1.3 Test Hadoop

Then we test Hadoop with two examples to make sure that it works.

1.3.1 Example 1 - calculate pi

bin/hadoop jar hadoop-examples-*.jar pi 10 100

In the above code, the first argument (10) is the number of maps and the second the number of samples per map. A more accurate value of pi can be obtained by setting a larger value to the second argument, which in turn would take longer to run.

1.3.2 Example 2 - word count

In this example, all files in local folder hadoop-1.1.2/conf are copied to a HDFS directory input, to be used as input for pattern searching. Of course you can use other available text files as input.

## copy files bin/hadoop fs -put conf input ## run distributed grep, and save results in directory *output* ## The pattern to find is 'dfs[a-z.]+'.
## Change it to 'df[a-z.]+' or 'd[a-z.]+' to get more results. bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' ## copy result from HDFS directory *output* to local directory *output* bin/hadoop fs -get output output ## have a look at results cat output/*

2 Set up Hadoop in cluster mode

If your Hadoop works in a standalone mode, you can then proceed to a cluster (full-distributed) mode.

2.1 Switching between different modes

You may want to keep settings for all three modes, because you will likely need to switch between different modes, for trouble-shooting in HBase and RHadoop installation at later stages. Therefore, you are suggested to keep settings for three modes in three separate directories, conf.single, conf.pseudo and conf.cluster, and use commands below to choose a specific setting. Same applies to HBase settings.

ln -s conf.single conf ln -s conf.pseudo conf ln -s conf.cluster conf

2.2 Setup name node (master machine)

Configure the following 3 files on master machine

  • core-site.xml
  • hdfs-site.xml
  • mapred-site.xml

Set masters and slaves files

  • file “masters”: IP address or hostname of namenode (master machine)
  • file “slaves”: a list of IP addresses or hostnames of datanodes (slave machines)

2.3 Set JAVA_HOME, set up remote desktop and enable self-login on all nodes

This is similar to step 1.2.2.

2.4 Copy public key

Copy the public key created on master node to all slave nodes.

2.5 Firewall

Enable incoming connections for Java on all machines, otherwise, slaves would not be able to receive any jobs.

2.6 Setup data nodes (slave machines)

Tar the hadoop directory on master node, copy it to all slaves and then untar it.

2.7 Format name node

Go to Handoop directory and run

bin/hadoop namenode -format

2.8 Run Hadoop

Start Hadoop

bin/start-all.sh

Monitor nodes and jobs with browser:

  • Namenode and HDFS file system: http://IP_ADDR_OF_NAMENODE:50070
  • Hadoop job tracker: http://IP_ADDR_OF_NAMENODE:50030

Stop Hadoop and MapReduce:

bin/stop-all.sh

2.9 Test Hadoop

You may want to test Hadoop in cluster mode, use the same code given at step 1.3.

2.10 Further Information

More instuctions on setting up Hadoop are available at links below.

2.10.1 Single-node mode

2.10.2 Cluster mode


3. Set up HBase

3.1 Set up HBase

You can skip this step if you are not going to use HBase.

See links below for detailed instructions on setting up HBase on Hadoop.

I used the settings given in section 2.4 - Example Configurations at this link to set up HBase in fully distributed mode.

3.2 Switching between different modes

Same as Hadoop, you are suggested to start with a stand-alone mode first. After that, you can switch to pseudo-distribution or cluster mode. However, you are suggested to keep settings for all three modes, e.g., for possible switching between different modes when you install RHadoop at a later stage. See step 2.1 for details about switching between different modes.


4. Install R

The version of R that I used is 3.1.0, the latest version as of May 2014. Previously I set up an R-Hadoop system with R 2.15.2 before, so it should work with other versions of R, at least with R 2.15.2 and above.

It is recommended to install RStudio as well, if it is not installed yet. This will make it easier for R programming and managing R projects, although it is not mandatory.


5. Install GCC, Homebrew, git, pkg-config and thrift

GCC, Homebrew, git, pkg-config and thrift are mandatory for installing rhbase. If you donot use HBase or rhbase, you donot need to install pkg-config or thrift.

5.1 Download and install GCC

Download GCC at https://github.com/kennethreitz/osx-gcc-installer. Without GCC, you will get error “Make Command Not Found” when installing some R packages from source.

5.2 Install Homebrew

Homebrew is a missing package manager for Mac OS X. The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew.

su <administrator_account> ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)" brew update brew doctor

Refer to the Homebrew website at http://brew.sh if any errors at above step.

5.3 Install git and pkg-config

brew install git brew install pkg-config

5.4 Install thrift 0.9.0

Thrift is needed for installing rhbase. If you donot use HBase, you might skip thrift installation.

Install thrift 0.9.0 instead of 0.9.1. I first installed thrift 0.9.1 (which was the latest version at that time), and found it didn't work well for rhbase installation. And then it was a painful process to figure out the reason, uninstall 0.9.1 and then install 0.9.0.

Do NOT run command below, which will install latest version of thrift (0.9.1 as of 9 May 2014).

## Do NOT run command below !!! brew install thrift

Instead, follow steps below to install thrift 0.9.0.

$ brew versions thrift Warning: brew-versions is unsupported and may be removed soon. Please use the homebrew-versions tap instead: https://github.com/Homebrew/homebrew-versions 0.9.1 git checkout eccc96b Library/Formula/thrift.rb 0.9.0 git checkout c43fc30 Library/Formula/thrift.rb 0.8.0 git checkout e5475d9 Library/Formula/thrift.rb 0.7.0 git checkout 141ddb6 Library/Formula/thrift.rb
...

Find the formula for thrift 0.9.0 in above list, and install with that formula.

## go to the homebrew base directory $ cd $( brew --prefix ) ## check out thrift 0.9.0 git checkout c43fc30 Library/Formula/thrift.rb ## instal thrift brew install thrift

Then we check whether pkg-config path is correct.

pkg-config --cflags thrift

The above command should return -I/usr/local/Cellar/thrift/0.9.0/include/thrift or -I/usr/local/include/thrift. Note that it should end with /include/thrift instead of /include. Otherwise, you will come across errors saying that some .h files can not be found when installing rhbase.

If you have any problem with installing thrift 0.9.0, see details about how to install a specific version of formula with Homebrew at http://stackoverflow.com/questions/3987683/homebrew-install-specific-version-of-formula.

5.5 More instructions

If there are problems with installing other packages above, more instructions can be found at links below.

Note that there are some differences between this process and instructions from the links below. For example, On Mac, there is no libthrift-0.9.0.so but libthrift-0.9.0.dylib, so I haven't run the command below to copy Thrift library.

sudo cp /usr/local/lib/libthrift-0.9.0.so /usr/lib/

6. Environment settings

Run code below in R to set environment variables for Hadoop.

Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2") Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop") Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")

Alternatively, add above to ~/.bashrc so that you don't need to set them every time.

export HADOOP_PREFIX=/Users/hadoop/hadoop-1.1.2 export HADOOP_CMD=/Users/hadoop/hadoop-1.1.2/bin/hadoop export HADOOP_STREAMING=/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar

7. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr

7.1 Install relevant R packages

install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest",
"functional", "stringr", "plyr", "reshape2", "dplyr",
"R.methodsS3", "caTools", "Hmisc"))

RHadoop packages are dependent on above packages, which should be installed for all users, instead of in personal library. Otherwise, you may see RHadoop jobs fail with an error saying “package *** is not installed”. For example, to make sure that package functional are installed in the correct library, run commands below and it should be in path /Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional, instead of /Users/YOUR_USER_ACCOUNT/Library/R/3.1/library/functional. If it is in the library under your user account, you need to reinstall it to /Library/Frameworks/R.framework/Versions/3.1/Resources/library/. If your account has no access to it, use an administrator account.

The destination library can be set with function install.packages() using argument lib (see an example below), or with RStudio, choose from a drop-down list under “Install to library” in a pop-up window Install Packages.

## find your R libraries .libPaths() #"/Users/hadoop/Library/R/3.1/library" #"/Library/Frameworks/R.framework/Versions/3.1/Resources/library" ## check which library a package was installed into system.file(package="functional") #"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional" ## install package to a specific library install.packages("functional", lib="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")

In addition to above packages, you are also suggested to install data.table. Without it, I came across an error when running an RHadoop job on a big dataset, although the same job worked fine on a smaller dataset. The reason could be that RHadoop uses data.table to handle large data.

install.packages("data.table")

7.2 Set environment variables HADOOP_CMD and HADOOP_STREAMING

Set environment variables for Hadoop, if you haven't done so at step 6.

Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2") Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")

7.3 Install RHadoop packages

Download packages rhdfs, rhbase, rmr2 and plyrmr from https://github.com/RevolutionAnalytics/RHadoop/wiki and install them. Same as step 7.1, these packages need to be installed to a library for all users, instead of to a personal library. Otherwise, you would find R-Hadoop jobs fail on those nodes where packages are not installed in the right library.

install.packages("<path>/rhdfs_1.0.8.tar.gz", repos=NULL, type="source") install.packages("<path>/rmr2_2.2.2.tar.gz", repos=NULL, type="source") install.packages("<path>plyrmr_0.2.0.tar.gz", repos=NULL, type="source") install.packages("<path>/rhbase_1.2.0.tar.gz", repos=NULL, type="source")

7.4 Further information

If you follow above instructions but still come across errors at this step, refer to rmr prerequisites and installation at https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr#prerequisites-and-installation.


8. Run an R job on Hadoop

Below is an example to count words in text files from HDFS folder wordcount/data. The R code is from Jeffrey Breen's presentation on Using R with Hadoop.

First, we copy some text files to HDFS folder wordcount/data.

## copy local text file to hdfs bin/hadoop fs -copyFromLocal /Users/hadoop/try-hadoop/wordcount/data/*.txt wordcount/data/

After that, we can use R code below to run a Hadoop job for word counting.

Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2") Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop") Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar") library(rmr2) ## map function map <- function(k,lines) { words.list <- strsplit(lines, '\\s') words <- unlist(words.list) return( keyval(words, 1) ) } ## reduce function reduce <- function(word, counts) { keyval(word, sum(counts)) } wordcount <- function (input, output=NULL) { mapreduce(input=input, output=output, input.format="text",
map=map, reduce=reduce) } ## delete previous result if any system("/Users/hadoop/hadoop-1.1.2/bin/hadoop fs -rmr wordcount/out") ## Submit job hdfs.root <- 'wordcount' hdfs.data <- file.path(hdfs.root, 'data') hdfs.out <- file.path(hdfs.root, 'out') out <- wordcount(hdfs.data, hdfs.out) ## Fetch results from HDFS results <- from.dfs(out) ## check top 30 frequent words results.df <- as.data.frame(results, stringsAsFactors=F) colnames(results.df) <- c('word', 'count') head(results.df[order(results.df$count, decreasing=T), ], 30)

If you can see a list of words and their frequencies, congratulations and now you are ready to do MapReduce work with R! 


9. Setting up multiple users

    Now you might want to set up accounts for other users to use Hadoop. Detailed instructions on that can be found at Setting Up Multiple Users in Hadoop Clusters.


    10. Further readings

    More examples of R jobs on Hadoop with rmr2 can be found at

    To learn MapReduce and Hadoop, below are some documents to read.

    Besides RHadoop, another way to run R jobs on Hadoop is using RHIPE.


    11. Contact and feedback

    If you have successfully built up your R-Hadoop system, could you please share your success with R users at this thread in the RDataMining group? Please also donot forget to forward this tutorial to your friends and colleagues who are interested in running R on Hadoop.

    If you have any comments or suggestions, or find errors in above process, please feel free to contact Yanchang Zhao yanchang@rdatamining.com, or post your questions to my RDataMining group on LinkedIn.

    Thanks.