Created on 12-31-2015 03:20 AM - edited 09-16-2022 01:33 AM
In general, you have following options when running R on Hortonworks Data Platform (HDP) -
o RHadoop (rmr) - R program written in MapReduce paradigm. MapReduce is not a vendor specific API and any program written with MapReduce is portable across Hadoop distributions.
https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr
o Hadoop Streaming - R program written to make use of Hadoop Streaming but the program structure still aligns with MapReduce. Above benefit still applies.
o RJDBC - This example does not require the R programs to be written using MapReduce and still remains 100% native R APIs without any third party packages.
Here is a tutorial with a video, sample data and R script:
http://hortonworks.com/hadoop-tutorial/using-revolution-r-enterprise-tutorial-hortonworks-sandbox/
Using RJDBC, the R program can have Hadoop parallelize pre-processing and filtering. R submits a query to Hive or SparkSQL making use of distributed and parallel processing. Then uses existing R models, as is & without any changes or use of any proprietary APIs. Typically speaking, any data science application involves a ton of prepping which is usually 75% of the work. RJDBC allows pushing that work to Hive to take advantage of distributed computing.
o Spark R - Lastly, the Spark R interface which is a newer component in Spark. SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. This component is available since Spark 1.4.1 (current version 1.5.2)
Here are some details on it -
https://spark.apache.org/docs/latest/sparkr.html
And the available API -
Created on 03-03-2016 02:49 AM
Also, checkout SparkR section in A Lap Around Apache Spark tutorial.
Created on 05-18-2016 09:12 PM
@bsaini
Additionally, RHive framework delivers the libraries and algorithms of R to data stored in Hadoop by extending Hive’s SQL-like query language (HiveQL) with R-specific functions. Through the RHive functions, you can use HiveQL to apply R statistical models to data in your Hadoop cluster that you have catalogued using Hive.
Regarding RHadoop, there is more than rmr2. There is also the rhdfs package which provides an R language API for file management over HDFS stores. Using rhdfs, users can read from HDFS stores to an R data frame (matrix), and similarly write data from these R matrices back into HDFS storage. Also the rhbase packages provide an R language API as well, but their goal in life is to deal with database management for HBase stores, rather than HDFS files.
Created on 07-14-2016 03:22 PM
I tried those, but got errors trying to run the example programs. More here: https://community.hortonworks.com/content/kbentry/8452/running-r-program-on-hdp.html