Community Articles

vjain · ‎02-02-2017

SparkR is an R package that provides a lightweight front end for using Apache Spark from R, thus supporting large-scale analytics on Hortonworks Data Platform (HDP) from the R language and environment. As of Spark 1.6.2, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, and aggregation on large datasets. In addition, SparkR supports distributed machine learning through MLlib.

Architecture

SparkR which was introduced to Spark in version1.4 consists of wrappers over DataFrames and DataFrame-based APIs. In SparkR, the APIs are similar to existing ones in R (or R packages), rather than Python/Java/Scala APIs. The reason is that SparkR is very popular is primarily because it allows users to write spark jobs while staying entirely in the R framework/model.

A SparkDataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R, but with richer optimizations under the hood. SparkDataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing local R data frames.

All of the examples on this page use sample data included in R or the Spark distribution and can be run using the ./bin/sparkR shell.

R is very convenient for analytics and users love it. However, scalability is the main issue with R and SparkR is a means to address that.

The key challenge that was addressed in implementing SparkR was having support for invoking Spark functions on a JVM from R. Spark Driver runs the R Spark Context which passes functions to R-JVM bridge. This is responsible for launching Worker JVMs.

Run SparkR example on HDP

1. Prerequisites

Before you run SparkR, ensure that your cluster meets the following prerequisites:

R must be installed on all nodes.
JAVA_HOME must be set on all nodes.

2. SparkR Example

The following example launches SparkR and then uses R to create a people DataFrame, list part of the DataFrame, and read the DataFrame. (For more information about Spark DataFrames, see "Using the Spark DataFrame API").

Launch SparkR:

$ su spark
$ cd /usr/hdp/2.5.0.0-3485/spark/bin
#$ ./sparkR

Output similar to the following displays:

Welcome to
    ____              __ 
   / __/__  ___ _____/ /__ 
  _\ \/ _ \/ _ `/ __/  '_/ 
 /___/ .__/\_,_/_/ /_/\_\   version  1.6.2
    /_/ 

Spark context is available as sc, SQL context is available as sqlContext
>

From your R prompt (not the Spark shell), initialize SQLContext, create a DataFrame, and list the first few rows:

sparkR> sqlContext <- sparkRSQL.init(sc)
sparkR> df <- createDataFrame(sqlContext, faithful)
sparkR> head(df)

Output similar to the following displays:

...
 eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

Read the people DataFrame:

sparkR> people <- read.df(sqlContext, "people.json", "json")
sparkR>

Output similar to the following displays:

 age    name
1  NA Michael
2  30    Andy
3  19  Justin

```
head(people)
```

Cloudera Community

Community Articles

SparkR primer

Apache Spark

Cloudera Data Science Workbench (CDSW)

Architecture

Run SparkR example on HDP

1. Prerequisites

2. SparkR Example