Created on 12-31-201503:20 AM - edited on 04-21-202606:30 AM by GrazittiAPI
In general, you have following options when running R on
Hortonworks Data Platform (HDP) -
o RHadoop (rmr) - R program written in MapReduce paradigm. MapReduce
is not a vendor specific API and any program written with MapReduce is portable
across Hadoop distributions.
o Hadoop Streaming - R program written to make use of Hadoop
Streaming but the program structure still aligns with MapReduce. Above benefit
still applies.
o RJDBC - This example does not require the R programs to be
written using MapReduce and still remains 100% native R APIs without any third
party packages.
Here is a tutorial with a video, sample data and R script:
Using RJDBC, the R program can have Hadoop parallelize
pre-processing and filtering. R submits a query to Hive or SparkSQL making use of distributed and parallel processing. Then uses existing R models,
as is & without any changes or use of any proprietary APIs. Typically speaking, any data science application involves a
ton of prepping which is usually 75% of the work. RJDBC allows pushing that
work to Hive to take advantage of distributed computing.
o Spark R - Lastly, the Spark R interface which is a newer
component in Spark. SparkR is an R package that provides a light-weight
frontend to use Apache Spark from R. This component is available since Spark 1.4.1 (current version 1.5.2)
Additionally, RHive framework delivers the libraries and algorithms of R to data stored in Hadoop by extending Hive’s SQL-like query language (HiveQL) with R-specific functions. Through the RHive functions, you can use HiveQL to apply R statistical models to data in your Hadoop cluster that you have catalogued using Hive.
Regarding RHadoop, there is more than rmr2. There is also the rhdfs package which provides an R language API for file management over HDFS stores. Using rhdfs, users can read from HDFS stores to an R data frame (matrix), and similarly write data from these R matrices back into HDFS storage. Also the rhbase packages provide an R language API as well, but their goal in life is to deal with database management for HBase stores, rather than HDFS files.