Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Running Spark 1.5.0 app on YARN

Running Spark 1.5.0 app on YARN

It was suggested that if I compiled my job with Spark 1.5.0 and provided the spark jars then it should work theoretically. has anybody tried to do this.

If you could you share your step and experiences. I am not using Hive so, that should not cause issues.


Also, I am using sbt to compile my program if it helps.


Thank You


Re: Running Spark 1.5.0 app on YARN

Master Collaborator
So, what I had in mind from a while ago is just downloading a new
assembly from say
and then running SPARK_JAR=... spark-shell --master yarn-client ...
Or instead, spark-shell --conf spark.yarn.jar=... --master yarn client

Trying it again just now I don't think that obviously works. Let me look

Re: Running Spark 1.5.0 app on YARN

New Contributor

I was facing the same issue attempting to run Spark 1.5.1 on a slightly older version of CDH. In my case I had to
make it run on CDH 5.4.2. I followed the steps below and got it run - at least to an acceptable state. The method
can be changed to suit other versions of CDH as well. Unfortunately, the latest version of CDH
(CDH-5.4.7 still has Spark 1.3.0 in it). Here are the steps I took:

1. Download the Spark source first.
2. export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
3. -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0-cdh5.4.2 -DskipTests -Phive -Phive-thriftserver
4. Stop the Spark service on your cluster. You now need to do some jar file surgery.
5. Copy the following three jar files to all machines in your cluster. I used the /tmp folder for this.

    scp these three files to /tmp on all machines.
cp dist/lib/spark-1.5.2-SNAPSHOT-yarn-shuffle.jar /tmp
cp dist/lib/spark-assembly-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar /tmp
cp dist/lib/spark-examples-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar /tmp

6. Repeat the following on all machines in your cluster.

cd /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/jars
sudo cp /tmp/spark-1.5.2-SNAPSHOT-yarn-shuffle.jar .
sudo cp /tmp/spark-assembly-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar .
sudo cp /tmp/spark-examples-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar .
cd /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/spark/lib
sudo rm spark-assembly.jar
sudo rm spark-assembly-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar
sudo ln -s ../../../jars/spark-assembly-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar .
sudo ln -s ../../../jars/spark-assembly-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar spark-assembly.jar
sudo rm spark-examples-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar
sudo rm spark-examples.jar
sudo ln -s ../../../jars/spark-examples-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar .
sudo ln -s ../../../jars/spark-examples-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar spark-examples.jar
cd /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop-yarn/lib
sudo rm spark-1.3.0-cdh5.4.2-yarn-shuffle.jar
sudo ln -s ../../../jars/spark-1.5.2-SNAPSHOT-yarn-shuffle.jar .
cd /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/spark/examples/lib
sudo rm spark-examples-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar
sudo ln -s ../../lib/spark-examples-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar .

7. Now start Spark Service. It should run.
8. However spark shell commands have not been copied yet. So spark-submit, spark-sql etc. will still fail.
You could replace those from dist/bin folder, but I decided to leave them as such, and instead change my
$PATH variable to point to dist/bin in ~/.bashrc before the rest, so that those are picked up first.

Mine looks like:
export PATH=.:$HOME/Developer/spark/dist/bin:$HOME/Applications/apache-maven-3.3.3/bin:$PATH

9. Finally add in .bashrc:
export HADOOP_CONF_DIR=/etc/hadoop/conf

Exit the shell and login again.

spark-submit and the other shell commands will now work on machines where you did the .bashrc change.

Re: Running Spark 1.5.0 app on YARN

Master Collaborator
I imagine that works, and this is helpful indeed, but you're
definitely messing with the cluster installation manually this way.
Fine for a test but not OK for prod.

I expect this can be done without all this surgery, since in theory
the YARN app just needs a different assembly, but I also haven't
gotten that working yet myself.