Created on 02-02-2017 04:05 PM - edited 09-16-2022 01:38 AM
SparkR is an R package that provides a lightweight front end for using Apache Spark from R, thus supporting large-scale analytics on Hortonworks Data Platform (HDP) from the R language and environment. As of Spark 1.6.2, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, and aggregation on large datasets. In addition, SparkR supports distributed machine learning through MLlib.
SparkR which was introduced to Spark in version1.4 consists of wrappers over DataFrames and DataFrame-based APIs. In SparkR, the APIs are similar to existing ones in R (or R packages), rather than Python/Java/Scala APIs. The reason is that SparkR is very popular is primarily because it allows users to write spark jobs while staying entirely in the R framework/model.
A SparkDataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R, but with richer optimizations under the hood. SparkDataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing local R data frames.
All of the examples on this page use sample data included in R or the Spark distribution and can be run using the ./bin/sparkR
shell.
R is very convenient for analytics and users love it. However, scalability is the main issue with R and SparkR is a means to address that.
The key challenge that was addressed in implementing SparkR was having support for invoking Spark functions on a JVM from R. Spark Driver runs the R Spark Context which passes functions to R-JVM bridge. This is responsible for launching Worker JVMs.
Before you run SparkR, ensure that your cluster meets the following prerequisites:
JAVA_HOME
must be set on all nodes.The following example launches SparkR and then uses R to create a people
DataFrame, list part of the DataFrame, and read the DataFrame. (For more information about Spark DataFrames, see "Using the Spark DataFrame API").
$ su spark $ cd /usr/hdp/2.5.0.0-3485/spark/bin #$ ./sparkR
Output similar to the following displays:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.2 /_/ Spark context is available as sc, SQL context is available as sqlContext >
sparkR> sqlContext <- sparkRSQL.init(sc) sparkR> df <- createDataFrame(sqlContext, faithful) sparkR> head(df)
Output similar to the following displays:
... eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55
people
DataFrame:sparkR> people <- read.df(sqlContext, "people.json", "json") sparkR>
Output similar to the following displays:
age name 1 NA Michael 2 30 Andy 3 19 Justin
head(people)
User | Count |
---|---|
763 | |
379 | |
316 | |
309 | |
270 |