question Re: SPARK PYSPARK SPARKR : Question Versionning in Support Questions

SPARK PYSPARK SPARKR : Question Versionning

maykiwogno — Thu, 29 Sep 2016 21:17:17 GMT

Hi all,

I'm not developper, I'm admin for Hadoop plateform.

We have intalled HDP 2.4.2 so with SPARK 1.6.1 my questions concerning Versionning about Python and R.

All my servers are installed with Centos 6.8 with python 2.6.6 so it is possible to use PySpark ?

My developper said his wants python 2.7.X I don't know why. If I need to install Python 2.7 or 3 is need to install on all plateform or just in one datanode or master?

SparkR needs R to install, it is not ship with Spark ?

thanks.

Re: SPARK PYSPARK SPARKR : Question Versionning

lgeorge — Thu, 29 Sep 2016 22:06:50 GMT

@mayki wogno, regarding your last question--I believe you need to install R separately before using it with Spark/SparkR. There is additional info in our HDP 2.5.0 documentation (SparkR is in tech preview until HDP 2.5); see http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/ch_spark-r.html.

Re: SPARK PYSPARK SPARKR : Question Versionning

dzaratsian — Fri, 30 Sep 2016 00:37:38 GMT

For Python:

I'd recommend installing python Anaconda 2.7 on all nodes of your cluster. If your developer would like to manually add python files/scripts, he can use the --py-files argument as part of the spark-submit statement. As an alternative, you can also reference python scripts/files from within your pyspark code using addPyFile, such as sc.addPyFile("mymodule.py"). Just as an FYI, PySpark will run fine if you have python 2.6 installed, but you will just not be able to use the more recent packages.

For R:

As @lgeorge mentioned, you will want to install R (and all required packages) to each node of your cluster. Also make sure your JAVA_HOME environment variable is set, then you should be able to launch SparkR.