Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

SPARK PYSPARK SPARKR : Question Versionning

avatar
Rising Star

Hi all,

I'm not developper, I'm admin for Hadoop plateform.

We have intalled HDP 2.4.2 so with SPARK 1.6.1 my questions concerning Versionning about Python and R.

All my servers are installed with Centos 6.8 with python 2.6.6 so it is possible to use PySpark ?

My developper said his wants python 2.7.X I don't know why. If I need to install Python 2.7 or 3 is need to install on all plateform or just in one datanode or master?

SparkR needs R to install, it is not ship with Spark ?

thanks.

1 ACCEPTED SOLUTION

avatar

For Python:

I'd recommend installing python Anaconda 2.7 on all nodes of your cluster. If your developer would like to manually add python files/scripts, he can use the --py-files argument as part of the spark-submit statement. As an alternative, you can also reference python scripts/files from within your pyspark code using addPyFile, such as sc.addPyFile("mymodule.py"). Just as an FYI, PySpark will run fine if you have python 2.6 installed, but you will just not be able to use the more recent packages.

For R:

As @lgeorge mentioned, you will want to install R (and all required packages) to each node of your cluster. Also make sure your JAVA_HOME environment variable is set, then you should be able to launch SparkR.

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

@mayki wogno, regarding your last question--I believe you need to install R separately before using it with Spark/SparkR. There is additional info in our HDP 2.5.0 documentation (SparkR is in tech preview until HDP 2.5); see http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/ch_spark-r.....

avatar

For Python:

I'd recommend installing python Anaconda 2.7 on all nodes of your cluster. If your developer would like to manually add python files/scripts, he can use the --py-files argument as part of the spark-submit statement. As an alternative, you can also reference python scripts/files from within your pyspark code using addPyFile, such as sc.addPyFile("mymodule.py"). Just as an FYI, PySpark will run fine if you have python 2.6 installed, but you will just not be able to use the more recent packages.

For R:

As @lgeorge mentioned, you will want to install R (and all required packages) to each node of your cluster. Also make sure your JAVA_HOME environment variable is set, then you should be able to launch SparkR.