Support Questions

maykiwogno · ‎09-29-2016

Hi all,

I'm not developper, I'm admin for Hadoop plateform.

We have intalled HDP 2.4.2 so with SPARK 1.6.1 my questions concerning Versionning about Python and R.

All my servers are installed with Centos 6.8 with python 2.6.6 so it is possible to use PySpark ?

My developper said his wants python 2.7.X I don't know why. If I need to install Python 2.7 or 3 is need to install on all plateform or just in one datanode or master?

SparkR needs R to install, it is not ship with Spark ?

thanks.

dzaratsian · ‎09-29-2016

For Python:

I'd recommend installing python Anaconda 2.7 on all nodes of your cluster. If your developer would like to manually add python files/scripts, he can use the --py-files argument as part of the spark-submit statement. As an alternative, you can also reference python scripts/files from within your pyspark code using addPyFile, such as sc.addPyFile("mymodule.py"). Just as an FYI, PySpark will run fine if you have python 2.6 installed, but you will just not be able to use the more recent packages.

For R:

As @lgeorge mentioned, you will want to install R (and all required packages) to each node of your cluster. Also make sure your JAVA_HOME environment variable is set, then you should be able to launch SparkR.

View solution in original post

lgeorge · ‎09-29-2016

@mayki wogno, regarding your last question--I believe you need to install R separately before using it with Spark/SparkR. There is additional info in our HDP 2.5.0 documentation (SparkR is in tech preview until HDP 2.5); see http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_spark-component-guide/content/ch_spark-r.....

dzaratsian · ‎09-29-2016

For Python:

I'd recommend installing python Anaconda 2.7 on all nodes of your cluster. If your developer would like to manually add python files/scripts, he can use the --py-files argument as part of the spark-submit statement. As an alternative, you can also reference python scripts/files from within your pyspark code using addPyFile, such as sc.addPyFile("mymodule.py"). Just as an FYI, PySpark will run fine if you have python 2.6 installed, but you will just not be able to use the more recent packages.

For R:

As @lgeorge mentioned, you will want to install R (and all required packages) to each node of your cluster. Also make sure your JAVA_HOME environment variable is set, then you should be able to launch SparkR.

Cloudera Community

Support Questions

SPARK PYSPARK SPARKR : Question Versionning

Spark and Java versions Supportability Matrix

Spark Scala Version Compatibility Matrix

Feature Releases of Apache Spark 3 minor versions

Spark (PySpark) to extract from SQL Server

NoClassDefFoundError due to Incompatible Spark Ver...

How to Create an Iceberg Table with PySpark in Clo...

Spark 3 legacy configurations list ( Spark 2 behav...

Spark Python Supportability Matrix

Spark Structured Streaming with NiFi and Kafka (us...

Using R packages with SparkR