Using RJDBC, the R program can have Hadoop parallelize
pre-processing and filtering. R submits a query to Hive or SparkSQL making use of distributed and parallel processing. Then uses existing R models,
as is & without any changes or use of any proprietary APIs. Typically speaking, any data science application involves a
ton of prepping which is usually 75% of the work. RJDBC allows pushing that
work to Hive to take advantage of distributed computing.
o Spark R - Lastly, the Spark R interface which is a newer
component in Spark. SparkR is an R package that provides a light-weight
frontend to use Apache Spark from R. This component is available since Spark 1.4.1 (current version 1.5.2)