30 May 2014If you are going to set up RHadoop on Linux, see RHadoop Installation Guide for Red Hat Enterprise Linux.
This is a step-by-step guide to setting up an R-Hadoop system. I have tested it both on a single computer and on a cluster of computers. Note that this process is for Mac OS X and some steps or settings might be different for Windows or Ubuntu.
To install Hadoop on Windows, you can find detailed instructions at
Below is a list of software used for this setup.
This process should work with Hadoop 2.2 or above and newer versions of HBase as well, but I haven't tested it yet.
Homebrew is a missing package manager for Mac OS X, and it is needed for install git, pkg-config and thrift. For other operating systems, the equivalents to Homebrew are apt-get on Ubuntu and yum on CentOS.
By the way, two painful steps in this process are setting up HBase on Hadoop in cluster mode and installing rhbase. If you want to have a quick start or are not going to use HBase, you donot need to intall thrift, HBase or rhbase, and therefore can skip
If building a Hadoop system for the first time, you are suggested to start with a stand-alone mode first, and then switch to pseudo-distributed mode and cluster (fully-distributed) mode.
Download Hadoop from http://hadoop.apache.org/releases.html#Download and then unpack it.
In file conf/hadoop_env.sh, add the line below:
Open the “System Preferences” window, and click “Sharing”“ (under "Internet & Wireless”). Under the list of services, check “Remote Login”. For extra security, you can hit the radio button for “Allow access for only these Users” and select your account, which we assume is “hadoop”.
After that, save authorized keys so that you can log in localhost without typing a password.
The above step to set up remote desktop and self-login was picked up from http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29, which provides detailed instructions to set up Hadoop on Mac.
After that, run commands below under system console to check whether Hadoop has been installed properly in a stand-alone mode.
Then we test Hadoop with two examples to make sure that it works.
In the above code, the first argument (10) is the number of maps and the second the number of samples per map. A more accurate value of pi can be obtained by setting a larger value to the second argument, which in turn would take longer to run.
In this example, all files in local folder hadoop-1.1.2/conf are copied to a HDFS directory input, to be used as input for pattern searching. Of course you can use other available text files as input.
If your Hadoop works in a standalone mode, you can then proceed to a cluster (full-distributed) mode.
You may want to keep settings for all three modes, because you will likely need to switch between different modes, for trouble-shooting in HBase and RHadoop installation at later stages. Therefore, you are suggested to keep settings for three modes in three separate directories, conf.single, conf.pseudo and conf.cluster, and use commands below to choose a specific setting. Same applies to HBase settings.
Configure the following 3 files on master machine
Set masters and slaves files
This is similar to step 1.2.2.
Copy the public key created on master node to all slave nodes.
Enable incoming connections for Java on all machines, otherwise, slaves would not be able to receive any jobs.
Tar the hadoop directory on master node, copy it to all slaves and then untar it.
Go to Handoop directory and run
Monitor nodes and jobs with browser:
Stop Hadoop and MapReduce:
You may want to test Hadoop in cluster mode, use the same code given at step 1.3.
More instuctions on setting up Hadoop are available at links below.
You can skip this step if you are not going to use HBase.
See links below for detailed instructions on setting up HBase on Hadoop.
I used the settings given in section 2.4 - Example Configurations at this link to set up HBase in fully distributed mode.
Same as Hadoop, you are suggested to start with a stand-alone mode first. After that, you can switch to pseudo-distribution or cluster mode. However, you are suggested to keep settings for all three modes, e.g., for possible switching between different modes when you install RHadoop at a later stage. See step 2.1 for details about switching between different modes.
The version of R that I used is 3.1.0, the latest version as of May 2014. Previously I set up an R-Hadoop system with R 2.15.2 before, so it should work with other versions of R, at least with R 2.15.2 and above.
It is recommended to install RStudio as well, if it is not installed yet. This will make it easier for R programming and managing R projects, although it is not mandatory.
GCC, Homebrew, git, pkg-config and thrift are mandatory for installing rhbase. If you donot use HBase or rhbase, you donot need to install pkg-config or thrift.
Download GCC at https://github.com/kennethreitz/osx-gcc-installer. Without GCC, you will get error “Make Command Not Found” when installing some R packages from source.
Homebrew is a missing package manager for Mac OS X. The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew.
Refer to the Homebrew website at http://brew.sh if any errors at above step.
Thrift is needed for installing rhbase. If you donot use HBase, you might skip thrift installation.
Install thrift 0.9.0 instead of 0.9.1. I first installed thrift 0.9.1 (which was the latest version at that time), and found it didn't work well for rhbase installation. And then it was a painful process to figure out the reason, uninstall 0.9.1 and then install 0.9.0.
Do NOT run command below, which will install latest version of thrift (0.9.1 as of 9 May 2014).
Instead, follow steps below to install thrift 0.9.0.
Find the formula for thrift 0.9.0 in above list, and install with that formula.
Then we check whether pkg-config path is correct.
The above command should return -I/usr/local/Cellar/thrift/0.9.0/include/thrift or -I/usr/local/include/thrift. Note that it should end with /include/thrift instead of /include. Otherwise, you will come across errors saying that some .h files can not be found when installing rhbase.
If you have any problem with installing thrift 0.9.0, see details about how to install a specific version of formula with Homebrew at http://stackoverflow.com/questions/3987683/homebrew-install-specific-version-of-formula.
If there are problems with installing other packages above, more instructions can be found at links below.
Note that there are some differences between this process and instructions from the links below. For example, On Mac, there is no libthrift-0.9.0.so but libthrift-0.9.0.dylib, so I haven't run the command below to copy Thrift library.
Run code below in R to set environment variables for Hadoop.
Alternatively, add above to ~/.bashrc so that you don't need to set them every time.
RHadoop packages are dependent on above packages, which should be installed for all users, instead of in personal library. Otherwise, you may see RHadoop jobs fail with an error saying “package *** is not installed”. For example, to make sure that package functional are installed in the correct library, run commands below and it should be in path /Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional, instead of /Users/YOUR_USER_ACCOUNT/Library/R/3.1/library/functional. If it is in the library under your user account, you need to reinstall it to /Library/Frameworks/R.framework/Versions/3.1/Resources/library/. If your account has no access to it, use an administrator account.
The destination library can be set with function
In addition to above packages, you are also suggested to install
Set environment variables for Hadoop, if you haven't done so at step 6.
If you follow above instructions but still come across errors at this step, refer to rmr prerequisites and installation at https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr#prerequisites-and-installation.
Below is an example to count words in text files from HDFS folder wordcount/data. The R code is from Jeffrey Breen's presentation on Using R with Hadoop.
First, we copy some text files to HDFS folder wordcount/data.
If you can see a list of words and their frequencies, congratulations and now you are ready to do MapReduce work with R!
Now you might want to set up accounts for other users to use Hadoop. Detailed instructions on that can be found at Setting Up Multiple Users in Hadoop Clusters.
More examples of R jobs on Hadoop with rmr2 can be found at
To learn MapReduce and Hadoop, below are some documents to read.
Besides RHadoop, another way to run R jobs on Hadoop is using RHIPE.
If you have successfully built up your R-Hadoop system, could you please share your success with R users at this thread in the RDataMining group? Please also donot forget to forward this tutorial to your friends and colleagues who are interested in running R on Hadoop.
If you have any comments or suggestions, or find errors in above process, please feel free to contact Yanchang Zhao email@example.com, or post your questions to my RDataMining group on LinkedIn.
Big Data >