Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark R and Python Libraries

avatar
Master Guru

It appear spark does not leverage full R and python libraries. I would like to understand why. Any feedback?

1 ACCEPTED SOLUTION

avatar
Master Guru

You mean the R functions you can use on SparkR dataframes? The problem here is that the R functions used on the dataframes need to be translated into Spark functions otherwise they would not run in parallel inside the engine. So this is a subset.

View solution in original post

4 REPLIES 4

avatar

Spark has a PySpark class that acts as a wrapper around Spark's scala-based libraries. It also provides REPL interface for the python interpreter. If you launch pySpark, you will be able to import whatever python libraries you have installed locally, i.e. python imports should work. Specifically (from the docs):

PySpark requires Python 2.6 or higher. PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions. We have not tested PySpark with Python 3 or with alternative Python interpreters, such as PyPy or Jython.

By default, PySpark requires python to be available on the system PATH and use it to run programs; an alternate Python executable may be specified by setting the PYSPARK_PYTHON environment variable in conf/spark-env.sh (or.cmd on Windows).

All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported.

Standalone PySpark applications should be run using the bin/pyspark script, which automatically configures the Java and Python environment using the settings in conf/spark-env.sh or .cmd. The script automatically adds thebin/pyspark package to the PYTHONPATH.

avatar
Master Guru

You mean the R functions you can use on SparkR dataframes? The problem here is that the R functions used on the dataframes need to be translated into Spark functions otherwise they would not run in parallel inside the engine. So this is a subset.

avatar
Master Guru

@Benjamin Leonhardi Please excuse my lack of expertise in spark. If only a subset of R functions are available due to the translation into spark functions, what are the alternativeS to run the R functions which do not translate into spark functions?

avatar
Master Guru

@Sunile Manjee

It depends. You can run any R function, but only a subset is supported directly on the dataframe. R functions are normally not parallelized so to have true parallel aggregations he needs to translate them into Spark code.

- You can always filter first in Spark and then copy your sparkr dataframe into a local normal R data frame using as.data.frame.

- Other similar tools support the execution of R code on rows/groups of data inside the cluster ( groupApply, TableApply, RowApply in other mapreduce frameworks ) however I do not see a way to do that in Spark they do not seem to have an R library distributed to every node but I might be wrong others can correct me.

- You always have the option to directly execute R from Scala and then do the grouping yourself but that will be a lot of effort

https://cran.r-project.org/web/packages/rscala/