The information provided in this page might be out-of-date. Please see a newer version at Step-by-Step Guide to Setting Up an R-Hadoop System.
This page shows how to build an R Hadoop system, and presents the steps to set up my first R Hadoop system in single-node mode on Mac OS X.
After reading documents and tutorials on MapReduce and Hadoop and playing with RHadoop for about 2 weeks, finally I have built my first R Hadoop system and successfully run some R examples on it. Here I’d like to share my experience and steps to achieve that. Hopefully it will make it easier to try RHadoop for R users who are new to Hadoop. Note that I tried this on Mac only and some steps might be different for Windows.
Before going through the complex steps below, let’s have a look what you can get, to give you a motivation to continue. There is a video showing Wordcount MapReduce in R at http://www.youtube.com/watch?v=hSrW0Iwghtw.
Now let’s start.
Download Hadoop (hadoop-1.1.2-bin.tar.gz) at http://hadoop.apache.org/releases.html#Download and then unpack it.
In conf/hadoop_env.sh, add the line below:
System Preferences > Sharing (under Internet & Network), Under the list of services, check "Remote Login". For extra security, you can hit the radio button for "Only these Users" and select hadoop.
The above step to set up remote desktop and self-login was picked up from http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29, which provides detailed instructions to set up Hadoop on Mac.
Go to hadoop directory
Start Hadoop and check whether Hadoop is running
Example 1. to calculate pi
In the code below, the first argument (10) is the number of Maps and the second the number of samples per map. A more accurate value of pi can be obtained by setting a larger value to the second argument, which in turn would take longer to run.
Example 2. word count
It first copies files from directory "conf" to "input", and then runs distributed "grep" to look for pattern 'dfs[a-z.]+', which matches strings starting with "dfs". Change it to 'df[a-z.]+' or 'd[a-z.]+' to get more results. The code below should return a list of words and their frequencies.
Congratulations! Now you have set up a Hadoop system. Next we will install RHadoop packages so that we can run R jobs on the Hadoop system.
Download R from http://www.r-project.org/ and install it. I have successfully run R v1.15.1, v1.15.2 and v3.1 on Hadoop.
Download GCC from https://github.com/kennethreitz/osx-gcc-installer and then install it. Without GCC, you will get error “Make Command Not Found” when installing some R packages from source.
The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew. Under Terminal of Mac OS X, run commands below.
Without thrift, package rhbase will fail to be installed.
Environments variables for Hadoop can be set with "export" command in Terminal, or with R functions below.
Install some R packages that RHadoop depends on.
Download packages rhdfs, rhbase and rmr2 from https://github.com/RevolutionAnalytics/RHadoop/wiki and then run the R code below.
To make sure they have been installed successfully, run library() to load the above three packages in R.
Prerequisites and installation about rmr are available at https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr#prerequisites-and-installation.
Now you can try run an R job on Hadoop. Below is an example of R MapReduce code for word counting, provided in presentation “Using R with Hadoop” by Jeffrey Breen at http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/.
After running the above code, you should see a list of words and their counts.
More examples of R jobs on Hadoop with rmr2 can be found at https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md and https://github.com/RevolutionAnalytics/rmr2/archive/master.zip.
All done! Now you have set up your own R Hadoop system (in a single-node mode). Enjoy MapReducing with R.
The above Hadoop system is in a standalone mode on a single node. You might want to try the pseudo-distributed mode. And you might even want to run it in a fully-distributed mode on a cluster of computers.
Instructions on setting up Hadoop in cluster mode are provided at Hadoop: From Sing-Node Mode to Cluster Mode.
More details on setting up Hadoop in standalone mode, pseudo-distributed mode and fully-distributed mode can be found at links below.
Big Data >