Reply
Explorer
Posts: 23
Registered: ‎01-14-2015

Running Spark 1.5.0 app on YARN

It was suggested that if I compiled my job with Spark 1.5.0 and provided the spark jars then it should work theoretically. has anybody tried to do this.

If you could you share your step and experiences. I am not using Hive so, that should not cause issues.

 

Also, I am using sbt to compile my program if it helps.

 

Thank You

Cloudera Employee
Posts: 366
Registered: ‎07-29-2013

Re: Running Spark 1.5.0 app on YARN

So, what I had in mind from a while ago is just downloading a new
assembly from say
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-assembly_2.10/1.5....
and then running SPARK_JAR=... spark-shell --master yarn-client ...
Or instead, spark-shell --conf spark.yarn.jar=... --master yarn client
...


Trying it again just now I don't think that obviously works. Let me look


New Contributor
Posts: 3
Registered: ‎10-02-2015

Re: Running Spark 1.5.0 app on YARN

I was facing the same issue attempting to run Spark 1.5.1 on a slightly older version of CDH. In my case I had to
make it run on CDH 5.4.2. I followed the steps below and got it run - at least to an acceptable state. The method
can be changed to suit other versions of CDH as well. Unfortunately, the latest version of CDH
(CDH-5.4.7 still has Spark 1.3.0 in it). Here are the steps I took:

1. Download the Spark source first.
2. export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
3. make-distribution.sh -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0-cdh5.4.2 -DskipTests -Phive -Phive-thriftserver
4. Stop the Spark service on your cluster. You now need to do some jar file surgery.
5. Copy the following three jar files to all machines in your cluster. I used the /tmp folder for this.

    scp these three files to /tmp on all machines.
cp dist/lib/spark-1.5.2-SNAPSHOT-yarn-shuffle.jar /tmp
cp dist/lib/spark-assembly-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar /tmp
cp dist/lib/spark-examples-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar /tmp

6. Repeat the following on all machines in your cluster.

cd /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/jars
sudo cp /tmp/spark-1.5.2-SNAPSHOT-yarn-shuffle.jar .
sudo cp /tmp/spark-assembly-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar .
sudo cp /tmp/spark-examples-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar .
cd /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/spark/lib
sudo rm spark-assembly.jar
sudo rm spark-assembly-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar
sudo ln -s ../../../jars/spark-assembly-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar .
sudo ln -s ../../../jars/spark-assembly-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar spark-assembly.jar
sudo rm spark-examples-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar
sudo rm spark-examples.jar
sudo ln -s ../../../jars/spark-examples-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar .
sudo ln -s ../../../jars/spark-examples-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar spark-examples.jar
cd /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop-yarn/lib
sudo rm spark-1.3.0-cdh5.4.2-yarn-shuffle.jar
sudo ln -s ../../../jars/spark-1.5.2-SNAPSHOT-yarn-shuffle.jar .
cd /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/spark/examples/lib
sudo rm spark-examples-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar
sudo ln -s ../../lib/spark-examples-1.5.2-SNAPSHOT-hadoop2.6.0-cdh5.4.2.jar .

7. Now start Spark Service. It should run.
8. However spark shell commands have not been copied yet. So spark-submit, spark-sql etc. will still fail.
You could replace those from dist/bin folder, but I decided to leave them as such, and instead change my
$PATH variable to point to dist/bin in ~/.bashrc before the rest, so that those are picked up first.

Mine looks like:
export PATH=.:$HOME/Developer/spark/dist/bin:$HOME/Applications/apache-maven-3.3.3/bin:$PATH

9. Finally add in .bashrc:
export HADOOP_CONF_DIR=/etc/hadoop/conf

Exit the shell and login again.

spark-submit and the other shell commands will now work on machines where you did the .bashrc change.

Highlighted
Cloudera Employee
Posts: 366
Registered: ‎07-29-2013

Re: Running Spark 1.5.0 app on YARN

I imagine that works, and this is helpful indeed, but you're
definitely messing with the cluster installation manually this way.
Fine for a test but not OK for prod.

I expect this can be done without all this surgery, since in theory
the YARN app just needs a different assembly, but I also haven't
gotten that working yet myself.