04-18-2016 02:29 PM
Is there a workaround to install multiple spark versions on the same cluster for different usage?
one of the products I want to use has compatibility issue with Spark 1.5 and it is only compatible with 1.3, so I need to install both versions 1.5 & 1.3 , is there a way to achieve this ?
04-19-2016 04:31 AM
Yes, YARN provides this flexibility. Here you can find the detailed answer.
For CDH there is a "Spark" service, which meant for YARN and another is "Spark Standalone" service which runs it's daemons standalone on the specified nodes.
YARN will do the work for you if you want to test the multiple versions simultaneously. You should have your multiple versions on Gateway Host and then you can launch Spark applications from there.
04-19-2016 06:15 AM
Thanks Umesh for your answer.
I tried installing Spark-Standalone as a servie on CDH 5.5.2. I am already running Spark on Yarn as a service.
I used CM and it installed Spark-Standalone as a service. But it installed version 1.5 which is Spark version with CDH 5.5.2. I dont see any option in CM where I can specify a Spark version to choose 1.3 instead of 1.5.
If I am running CDH 5.5.2 and Spark on Yarn as service. And I want to install Spark Stand Alone 1.3 verison. Can it be done using CM or has it to be a manual installation ? If it can be done using CM then how to specify Spark version ?
04-20-2016 03:29 AM
CM is supporting single version for Spark on YARN and single version for Standalone installation (Single version is common requirement).
For supporting multiple versions of Spark you need to install it manually on a single node and copy the config files for YARN and Hive inside its conf directory. And when you refer the spark-submit of that version, it will distribute the Spark-core binary on each YARN nodes to execute your code. You don't need to install Spark on each YARN nodes.
08-31-2016 12:32 AM
Can wee for example install Spark 2.0 on just the EDGE NODE ? or do we have to install SPARK 2.0 on at least on cluster node (datanode ?)
Thanks for your reply
08-31-2016 01:22 AM
There's actually not a notion of 'installing Spark on the cluster', really. It's a big JAR file that gets run along with some user code in a YARN container.
For example, yesterday I took the vanilla upstream Apache Spark 2.0.0 (+ Hadoop 2.7) binary distribution, unpacked it on one cluster node, (and set HADOOP_CONF_DIR,) and was able to run the Spark 2.0.0 shell on a CDH 5.8 cluster with no further changes. Not everything works out of the box, like anything touching the Hive metastore, which would require a little more tweaking / config. But that's about it for 'installation', at heart.
Note this is of course not supported, but, it's also something you can try without modifying any installation, which of course you would never want to do.
09-15-2016 09:18 AM
I'm trying to follow what you did.
In my case, I tried Apache Spark 2.1.0 (+ Hadoop 2.6) and unpacked it on one cluster node. I am using CDH 5.2 and changed HADOOP_CONF_DIR on spark-env.sh found in the conf folder. However I can't make it work. Any idea on how I can make it work?
02-16-2017 06:20 AM
Thanks to @Deenar Toraskar, CFA for the following guide, I'm sharing it with all of you. Extracted from:
Spark 2.0 has just been released and has many features that make Spark easier, faster, and smarter. The latest Cloudera Hadoop distribution (CDH 5.8.0) currently ships with Spark 1.6 or you may be running an earlier version of CDH. This post show how to run Spark 2.0 on your CDH cluster.
Since Spark can be run as a YARN application it is possible to run a Spark version other than the one that comes bundled with the Cloudera distribution. This requires no administrator privileges and no changes to the cluster configuration and can be done by any user who has permission to run a YARN job on the cluster. A YARN application ships over all it’s dependencies over to the cluster for each invocation. You can run multiple Spark versions simultaneously on a YARN cluster. Each version of Spark is self contained in in the user workspace on the Edge node. Running a new Spark version will not affect any other jobs running on your cluster.
echo "log4j.logger.org.spark_project.jetty=ERROR" >> conf/log4j.properties
[Optional but highly recommended steps - set either spark.yarn.archive or spark.yarn.jars]. Demise of assemblies. Spark 2.0 is moving away from using the huge assembly file to a directory full of jars to distribute its dependencies. See Spark-11157 andhttps://issues.apache.org/jira/secure/attachment/12767129/no-assemblies.pdffor more information. As a result spark.yarn.jar is now superseded by spark.yarn.jars or spark.yarn.archive.
$ Cd $SPARK_HOME
$ zip spark-archive.zip jars/*
$ hadoop fs -copyFromLocal spark-archive.zip
$ echo "spark.yarn.archive=hdfs:///nameservice1/user/<yourusername>/spark-archive.zip" >> conf/spark-defaults.conf
$ Cd $SPARK_HOME
$ hadoop fs mkdir spark-2.0.0-bin-hadoop
$hadoop fs -copyFromLocal jars/* spark-2.0.0-bin-hadoop
$ echo "spark.yarn.jars=hdfs:///nameservice1/user/<yourusername>/spark-2.0.0-bin-hadoop/*" >> conf/spark-defaults.conf
If you do have access to the local directories of all the nodes in your cluster you can copy the archive or spark jars to the local directory of each of the data nodes using rsync or scp. Just update the URLs from hdfs:/ to local: