Created 09-29-2016 02:17 PM
Hi all,
I'm not developper, I'm admin for Hadoop plateform.
We have intalled HDP 2.4.2 so with SPARK 1.6.1 my questions concerning Versionning about Python and R.
All my servers are installed with Centos 6.8 with python 2.6.6 so it is possible to use PySpark ?
My developper said his wants python 2.7.X I don't know why. If I need to install Python 2.7 or 3 is need to install on all plateform or just in one datanode or master?
SparkR needs R to install, it is not ship with Spark ?
thanks.
Created 09-29-2016 05:37 PM
For Python:
I'd recommend installing python Anaconda 2.7 on all nodes of your cluster. If your developer would like to manually add python files/scripts, he can use the --py-files argument as part of the spark-submit statement. As an alternative, you can also reference python scripts/files from within your pyspark code using addPyFile, such as sc.addPyFile("mymodule.py"). Just as an FYI, PySpark will run fine if you have python 2.6 installed, but you will just not be able to use the more recent packages.
For R:
As @lgeorge mentioned, you will want to install R (and all required packages) to each node of your cluster. Also make sure your JAVA_HOME environment variable is set, then you should be able to launch SparkR.
Created 09-29-2016 03:06 PM
@mayki wogno, regarding your last question--I believe you need to install R separately before using it with Spark/SparkR. There is additional info in our HDP 2.5.0 documentation (SparkR is in tech preview until HDP 2.5); see http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/ch_spark-r.....
Created 09-29-2016 05:37 PM
For Python:
I'd recommend installing python Anaconda 2.7 on all nodes of your cluster. If your developer would like to manually add python files/scripts, he can use the --py-files argument as part of the spark-submit statement. As an alternative, you can also reference python scripts/files from within your pyspark code using addPyFile, such as sc.addPyFile("mymodule.py"). Just as an FYI, PySpark will run fine if you have python 2.6 installed, but you will just not be able to use the more recent packages.
For R:
As @lgeorge mentioned, you will want to install R (and all required packages) to each node of your cluster. Also make sure your JAVA_HOME environment variable is set, then you should be able to launch SparkR.