Created on 02-02-201704:05 PM - edited 08-17-201905:14 AM
SparkR is an R package that provides a lightweight front end for using Apache Spark from R, thus supporting large-scale analytics on Hortonworks Data Platform (HDP) from the R language and environment. As of Spark 1.6.2, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, and aggregation on large datasets. In addition, SparkR supports distributed machine learning through MLlib.
SparkR which was introduced to Spark in version1.4 consists of wrappers over DataFrames and DataFrame-based APIs. In SparkR, the APIs are similar to existing ones in R (or R packages), rather than Python/Java/Scala APIs. The reason is that SparkR is very popular is primarily because it allows users to write spark jobs while staying entirely in the R framework/model.
A SparkDataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R, but with richer optimizations under the hood. SparkDataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing local R data frames.
All of the examples on this page use sample data included in R or the Spark distribution and can be run using the ./bin/sparkR shell.
R is very convenient for analytics and users love it. However, scalability is the main issue with R and SparkR is a means to address that.
The key challenge that was addressed in implementing SparkR was having
support for invoking Spark functions on a JVM from R. Spark Driver runs the R Spark Context which passes functions to R-JVM bridge. This is responsible for launching Worker JVMs.
Run SparkR example on HDP
Before you run SparkR, ensure that your cluster meets the following prerequisites:
R must be installed on all nodes.
JAVA_HOME must be set on all nodes.
2. SparkR Example
The following example launches SparkR and then uses R to create a people DataFrame, list part of the DataFrame, and read the DataFrame. (For more information about Spark DataFrames, see "Using the Spark DataFrame API").
$ su spark
$ cd /usr/hdp/126.96.36.199-3485/spark/bin
Output similar to the following displays:
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.2
Spark context is available as sc, SQL context is available as sqlContext
From your R prompt (not the Spark shell), initialize SQLContext, create a DataFrame, and list the first few rows: