Community Articles

Find and share helpful community-sourced technical articles.
avatar

In general, you have following options when running R on Hortonworks Data Platform (HDP) -

o RHadoop (rmr) - R program written in MapReduce paradigm. MapReduce is not a vendor specific API and any program written with MapReduce is portable across Hadoop distributions.

https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr

o Hadoop Streaming - R program written to make use of Hadoop Streaming but the program structure still aligns with MapReduce. Above benefit still applies.

o RJDBC - This example does not require the R programs to be written using MapReduce and still remains 100% native R APIs without any third party packages.

Here is a tutorial with a video, sample data and R script:

http://hortonworks.com/hadoop-tutorial/using-revolution-r-enterprise-tutorial-hortonworks-sandbox/

Using RJDBC, the R program can have Hadoop parallelize pre-processing and filtering. R submits a query to Hive or SparkSQL making use of distributed and parallel processing. Then uses existing R models, as is & without any changes or use of any proprietary APIs. Typically speaking, any data science application involves a ton of prepping which is usually 75% of the work. RJDBC allows pushing that work to Hive to take advantage of distributed computing.

o Spark R - Lastly, the Spark R interface which is a newer component in Spark. SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. This component is available since Spark 1.4.1 (current version 1.5.2)

Here are some details on it -

https://spark.apache.org/docs/latest/sparkr.html

And the available API -

https://spark.apache.org/docs/latest/api/R/

7,380 Views
Comments
avatar

Also, checkout SparkR section in A Lap Around Apache Spark tutorial.

avatar
Super Guru

@bsaini

Additionally, RHive framework delivers the libraries and algorithms of R to data stored in Hadoop by extending Hive’s SQL-like query language (HiveQL) with R-specific functions. Through the RHive functions, you can use HiveQL to apply R statistical models to data in your Hadoop cluster that you have catalogued using Hive.

Regarding RHadoop, there is more than rmr2. There is also the rhdfs package which provides an R language API for file management over HDFS stores. Using rhdfs, users can read from HDFS stores to an R data frame (matrix), and similarly write data from these R matrices back into HDFS storage. Also the rhbase packages provide an R language API as well, but their goal in life is to deal with database management for HBase stores, rather than HDFS files.

avatar
New Contributor

I tried those, but got errors trying to run the example programs. More here: https://community.hortonworks.com/content/kbentry/8452/running-r-program-on-hdp.html