Tutorials‎ > ‎

    Building an R Hadoop System

    The information provided in this page might be out-of-date. Please see a newer version at Step-by-Step Guide to Setting Up an R-Hadoop System.
    This page shows how to build an R Hadoop system, and presents the steps to set up my first R Hadoop system in single-node mode on Mac OS X.

    After reading documents and tutorials on MapReduce and Hadoop and playing with RHadoop for about 2 weeks, finally I have built my first R Hadoop system and successfully run some R examples on it. Here I’d like to share my experience and steps to achieve that. Hopefully it will make it easier to try RHadoop for R users who are new to Hadoop. Note that I tried this on Mac only and some steps might be different for Windows.

    Before going through the complex steps below, let’s have a look what you can get, to give you a motivation to continue. There is a video showing Wordcount MapReduce in R at http://www.youtube.com/watch?v=hSrW0Iwghtw

    Now let’s start.

    1. Install Hadoop

    1.1 Download Hadoop

    Download Hadoop (hadoop-1.1.2-bin.tar.gz) at http://hadoop.apache.org/releases.html#Download and then unpack it.

    1.2 Set JAVA_HOME

    In conf/hadoop_env.sh, add the line below:

    export JAVA_HOME=/Library/Java/Home

    1.3 Set up Remote Desktop and Enabling Self-Login

    System Preferences > Sharing (under Internet & Network), Under the list of services, check "Remote Login". For extra security, you can hit the radio button for "Only these Users" and select hadoop.

    ssh-keygen -t rsa -P ""

    cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

    The above step to set up remote desktop and self-login was picked up from http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29, which provides detailed instructions to set up Hadoop on Mac.

    2. Run Hadoop

    2.1 Start Hadoop

    Go to hadoop directory

    cd hadoop-1.1.2

    bin/hadoop

    Start Hadoop and check whether Hadoop is running

    bin/start-all.sh

    jps

    2.2 Test Hadoop with simple examples

    Example 1. to calculate pi

    In the code below, the first argument (10) is the number of Maps and the second the number of samples per map. A more accurate value of pi can be obtained by setting a larger value to the second argument, which in turn would take longer to run.

    bin/hadoop jar hadoop-examples-*.jar pi 10 100

    Example 2. word count

    It first copies files from directory "conf" to "input", and then runs distributed "grep" to look for pattern 'dfs[a-z.]+', which matches strings starting with "dfs". Change it to 'df[a-z.]+' or 'd[a-z.]+' to get more results. The code below should return a list of words and their frequencies.

    bin/hadoop fs -put conf input

    bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

    bin/hadoop fs -get output output

    cat output/*

    2.3 Stop Hadoop

    bin/stop-all.sh

    Congratulations! Now you have set up a Hadoop system. Next we will install RHadoop packages so that we can run R jobs on the Hadoop system.

    3. Install R

    Download R from http://www.r-project.org/ and install it. I have successfully run R v1.15.1, v1.15.2 and v3.1 on Hadoop.

    4. Install RHadoop

    4.1 Install GCC

    Download GCC from https://github.com/kennethreitz/osx-gcc-installer and then install it. Without GCC, you will get error “Make Command Not Found” when installing some R packages from source.

    4.2 Install Homebrew

    The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew. Under Terminal of Mac OS X, run commands below.

    su <administrator_account>

    ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"

    brew update

    brew doctor

    4.3 Install packages git, pkg-config and thrift

    Without thrift, package rhbase will fail to be installed.

    brew install git

    brew install pkg-config

    brew install thrift

    4.4 Set environment variable

    Environments variables for Hadoop can be set with "export" command in Terminal, or with R functions below.

    Sys.setenv("HADOOP_CMD"="<path>/hadoop-1.1.2/bin/hadoop")

    Sys.setenv("HADOOP_STREAMING"="<path>/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")

    Sys.getenv("HADOOP_CMD")

     4.5 Install R package

    Install some R packages that RHadoop depends on.

    install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2"))

    4.6 Install RHadoop packages

    Download packages rhdfs, rhbase and rmr2 from https://github.com/RevolutionAnalytics/RHadoop/wiki and then run the R code below.

    install.packages("<path>/rhdfs_1.0.6.tar.gz", repos = NULL, type="source")

    install.packages("<path>/rhbase_1.2.0.tar.gz", repos = NULL, type="source")

    install.packages("<path>/rmr2_2.2.2.tar.gz", repos = NULL, type="source")

    To make sure they have been installed successfully, run library() to load the above three packages in R.

    4.7 Further information

    Prerequisites and installation about rmr are available at https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr#prerequisites-and-installation.

    5. Run R jobs on Hadoop

    Now you can try run an R job on Hadoop. Below is an example of R MapReduce code for word counting, provided in presentation “Using R with Hadoop” by Jeffrey Breen at http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/.

    Sys.setenv("HADOOP_HOME"="<path>/hadoop-1.1.2")

    library(rmr2)

     

    map <- function(k,lines) {

      words.list <- strsplit(lines, '\\s')

      words <- unlist(words.list)

      return( keyval(words, 1) )

    }

     

    reduce <- function(word, counts) {

      keyval(word, sum(counts))

    }

     

    wordcount <- function (input, output=NULL) {

      mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)

    }


    ## read text files from folder wordcount/data

    ## save result in folder wordcount/out

    ## Submit job

    hdfs.root <- 'wordcount'

    hdfs.data <- file.path(hdfs.root, 'data')

    hdfs.out <- file.path(hdfs.root, 'out')

    out <- wordcount(hdfs.data, hdfs.out)

     

    ## Fetch results from HDFS

    results <- from.dfs(out)

    results.df <- as.data.frame(results, stringsAsFactors=F)

    colnames(results.df) <- c('word', 'count')

    head(results.df)

    After running the above code, you should see a list of words and their counts.

    More examples of R jobs on Hadoop with rmr2 can be found at https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md and https://github.com/RevolutionAnalytics/rmr2/archive/master.zip.

    All done! Now you have set up your own R Hadoop system (in a single-node mode). Enjoy MapReducing with R.

    6. What's Next

    The above Hadoop system is in a standalone mode on a single node. You might want to try the pseudo-distributed mode. And you might even want to run it in a fully-distributed mode on a cluster of computers.

    Instructions on setting up Hadoop in cluster mode are provided at Hadoop: From Sing-Node Mode to Cluster Mode.

    More details on setting up Hadoop in standalone mode, pseudo-distributed mode and fully-distributed mode can be found at links below.