Support Questions

Find answers, ask questions, and share your expertise

Multiple Spark version on the same cluster

avatar
New Contributor

Is there a workaround to install multiple spark versions on the same cluster for different usage?

 

one of the products I want to use has compatibility issue with Spark 1.5 and it is only compatible with 1.3, so I need to install both versions 1.5 & 1.3 , is there a way to achieve this ?

1 ACCEPTED SOLUTION

avatar
Rising Star

CM is supporting single version for Spark on YARN and single version for Standalone installation (Single version is  common requirement). 

 

For supporting multiple versions of Spark you need to install it manually on a single node and copy the config files for YARN and Hive inside its conf directory. And when you refer the spark-submit of that version, it will distribute the Spark-core binary on each YARN nodes to execute your code. You don't need to install Spark on each YARN nodes.

View solution in original post

12 REPLIES 12

avatar
Rising Star

Yes, YARN provides this flexibility. Here you can find the detailed answer. 

 

For CDH there is a "Spark" service, which meant for YARN and another is "Spark Standalone" service which runs it's daemons standalone on the specified nodes. 

 

YARN will do the work for you if you want to test the multiple versions simultaneously. You should have your multiple versions on Gateway Host and then you can launch Spark applications from there.

avatar
New Contributor

Thanks Umesh for your answer.

 

I tried installing Spark-Standalone as a servie on CDH 5.5.2. I am already running Spark on Yarn as a service.

 

I used CM and it installed Spark-Standalone as a service. But it installed version 1.5 which is Spark version with CDH 5.5.2. I dont see any option in CM where I can specify a Spark version to choose 1.3 instead of 1.5.

 

If I am running CDH 5.5.2 and Spark on Yarn as service. And I want to install Spark Stand Alone 1.3 verison. Can it be done using CM or has it to be a manual installation ? If it can be done using CM then how to specify Spark version ?

 

Thanks

Ahmed

avatar
Rising Star

CM is supporting single version for Spark on YARN and single version for Standalone installation (Single version is  common requirement). 

 

For supporting multiple versions of Spark you need to install it manually on a single node and copy the config files for YARN and Hive inside its conf directory. And when you refer the spark-submit of that version, it will distribute the Spark-core binary on each YARN nodes to execute your code. You don't need to install Spark on each YARN nodes.

avatar
New Contributor

Hi,

 

Can wee for example install Spark 2.0 on just the EDGE NODE ? or do we have to install SPARK 2.0 on at least on cluster node (datanode ?)

 

Thanks for your reply

 

Regards

 

David

avatar
Master Collaborator

There's actually not a notion of 'installing Spark on the cluster', really. It's a big JAR file that gets run along with some user code in a YARN container. 

 

For example, yesterday I took the vanilla upstream Apache Spark 2.0.0 (+ Hadoop 2.7) binary distribution, unpacked it on one cluster node, (and set HADOOP_CONF_DIR,) and was able to run the Spark 2.0.0 shell on a CDH 5.8 cluster with no further changes. Not everything works out of the box, like anything touching the Hive metastore, which would require a little more tweaking / config. But that's about it for 'installation', at heart.

 

Note this is of course not supported, but, it's also something you can try without modifying any installation, which of course you would never want to do.

avatar
New Contributor

I'm trying to follow what you did.

 

In my case, I tried Apache Spark 2.1.0 (+ Hadoop 2.6) and unpacked it on one cluster node. I am using CDH 5.2  and changed HADOOP_CONF_DIR on spark-env.sh found in the conf folder. However I can't make it work. Any idea on how I can make it work?

avatar
New Contributor

hi Titus,

 

Did you got it worked?

If yes ,please furnish the procedure.

Thanks,

Raghu

avatar
Explorer

Hi guys,

 

Thanks to @Deenar Toraskar, CFA for the following guide, I'm sharing it with all of you. Extracted from:

 

https://www.linkedin.com/pulse/running-spark-2xx-cloudera-hadoop-distro-cdh-deenar-toraskar-cfa

 

The Guide:

 

Spark 2.0 has just been released and has many features that make Spark easier, faster, and smarter. The latest Cloudera Hadoop distribution (CDH 5.8.0) currently ships with Spark 1.6 or you may be running an earlier version of CDH.  This post show how to run Spark 2.0 on your CDH cluster.

 

Since Spark can be run as a YARN application it is possible to run a Spark version other than the one that comes bundled with the Cloudera distribution. This requires no administrator privileges and no changes to the cluster configuration and can be done by any user who has permission to run a YARN  job on the cluster. A YARN application ships over all it’s dependencies over to the cluster for each invocation. You can run multiple Spark versions simultaneously on a YARN cluster. Each version of Spark is self contained in in the user workspace on the Edge node. Running a new Spark version will not affect any other jobs running on your cluster.

 

  1. Find the version of CDH and Hadoop running on your cluster using
    $ hadoop version
    Hadoop 2.6.0-cdh5.4.8
  2. Download Spark and extract the sources. Pre built Spark binaries should work out of the box with most CDH versions, unless there are custom fixes in your CDH build in which case you can use the spark-2.0.0-bin-without-hadoop.tgz.
  3. (Optional) You can also build Spark by opening the distribution directory in the shell and running the following command using the CDH and Hadoop version from step 1
    $ ./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
    Note: With Spark 2.0 the default build uses Scala version 2.11. If you need to stick to Scala 2.10, use the -Dscala-2.10 property or
    $ ./dev/change-scala-version.sh 2.10
    Note that -Phadoop-provided enables the profile to build the assembly without including Hadoop-ecosystem dependencies provided by Cloudera.
  4. Extract the tgz file.
    $tar -xvzf /path/to/spark-2.0.0-bin-hadoop2.6.tgz
  5. cd into the custom Spark distribution and configure the custom Spark distribution by copying the configuration from your current Spark version
    $ cp -R /etc/spark/conf/* conf/
    $ cp /etc/hive/conf/hive-site.xml conf/
  6. Change SPARK_HOME to point to folder with the Spark 2.0 distribution
    $ sed -i "s#\(.*SPARK_HOME\)=.*#\1=$(pwd)#" conf/spark-env.sh
  7. Change spark.master to yarn from yarn-client in spark-defaults.conf
    $ sed -i 's/spark.master=yarn-client/spark.master=yarn/' conf/spark-defaults.con
  8. Delete spark.yarn.jar from spark-defaults.conf
    $ sed '-i /spark.yarn.jar/d' conf/spark-defaults.conf
  9. Finally test your new Spark installation:
    $ ./bin/run-example SparkPi 10 --master yarn
    $ ./bin/spark-shell --master yarn
    $ ./bin/pyspark
  10. Update log4j.properties to suppress annoying warnings. add the following to conf/log4j.properties 

    echo "log4j.logger.org.spark_project.jetty=ERROR" >> conf/log4j.properties

[Optional but highly recommended steps - set either spark.yarn.archive or spark.yarn.jars]. Demise of assemblies. Spark 2.0 is moving away from using the huge assembly file to a directory full of jars to distribute its dependencies. See Spark-11157 andhttps://issues.apache.org/jira/secure/attachment/12767129/no-assemblies.pdffor more information. As a result spark.yarn.jar is now superseded by spark.yarn.jars or spark.yarn.archive.

 

  • Tell YARN which Spark JAR to use. Make a Spark Yarn archive or copy Spark jars to hdfs 

$ Cd  $SPARK_HOME
$ zip spark-archive.zip jars/*
$ hadoop fs -copyFromLocal spark-archive.zip 
$ echo "spark.yarn.archive=hdfs:///nameservice1/user/<yourusername>/spark-archive.zip" >> conf/spark-defaults.conf

OR

  •  set "spark.yarn.jars"

$ Cd  $SPARK_HOME
$ hadoop fs mkdir spark-2.0.0-bin-hadoop
$hadoop fs -copyFromLocal jars/* spark-2.0.0-bin-hadoop
$ echo "spark.yarn.jars=hdfs:///nameservice1/user/<yourusername>/spark-2.0.0-bin-hadoop/*" >> conf/spark-defaults.conf

 

If you do have access to the local directories of all the nodes in your cluster you can copy the archive or spark jars to the local directory of each of the data nodes using rsync or scp. Just update the URLs from hdfs:/ to local:

avatar
Master Collaborator

No, you shouldn't do this. Spark 2 has been GA for CDH for a while. Use the official Spark 2 CSD.