Created 02-17-2016 09:26 PM
It appear spark does not leverage full R and python libraries. I would like to understand why. Any feedback?
Created 02-17-2016 11:11 PM
You mean the R functions you can use on SparkR dataframes? The problem here is that the R functions used on the dataframes need to be translated into Spark functions otherwise they would not run in parallel inside the engine. So this is a subset.
Created 02-17-2016 10:46 PM
Spark has a PySpark class that acts as a wrapper around Spark's scala-based libraries. It also provides REPL interface for the python interpreter. If you launch pySpark, you will be able to import whatever python libraries you have installed locally, i.e. python imports should work. Specifically (from the docs):
PySpark requires Python 2.6 or higher. PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions. We have not tested PySpark with Python 3 or with alternative Python interpreters, such as PyPy or Jython.
By default, PySpark requires python
to be available on the system PATH
and use it to run programs; an alternate Python executable may be specified by setting the PYSPARK_PYTHON
environment variable in conf/spark-env.sh
(or.cmd
on Windows).
All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported.
Standalone PySpark applications should be run using the bin/pyspark
script, which automatically configures the Java and Python environment using the settings in conf/spark-env.sh
or .cmd
. The script automatically adds thebin/pyspark
package to the PYTHONPATH
.
Created 02-17-2016 11:11 PM
You mean the R functions you can use on SparkR dataframes? The problem here is that the R functions used on the dataframes need to be translated into Spark functions otherwise they would not run in parallel inside the engine. So this is a subset.
Created 02-18-2016 06:18 PM
@Benjamin Leonhardi Please excuse my lack of expertise in spark. If only a subset of R functions are available due to the translation into spark functions, what are the alternativeS to run the R functions which do not translate into spark functions?
Created 02-19-2016 12:32 PM
It depends. You can run any R function, but only a subset is supported directly on the dataframe. R functions are normally not parallelized so to have true parallel aggregations he needs to translate them into Spark code.
- You can always filter first in Spark and then copy your sparkr dataframe into a local normal R data frame using as.data.frame.
- Other similar tools support the execution of R code on rows/groups of data inside the cluster ( groupApply, TableApply, RowApply in other mapreduce frameworks ) however I do not see a way to do that in Spark they do not seem to have an R library distributed to every node but I might be wrong others can correct me.
- You always have the option to directly execute R from Scala and then do the grouping yourself but that will be a lot of effort