Support Questions
Find answers, ask questions, and share your expertise

Spark cluster managed by Yarn throws java.lang.ClassNotFoundException

Spark cluster managed by Yarn throws java.lang.ClassNotFoundException

Explorer

I am new to Yarn but have some experience with Spark stand alone master, I have recently installed a Yarn+Spark cluster using Ambari.

I have a spark program (program.jar) compiled to jar which relies on another jar to work (infra.jar).

I set the following configurations in Ambari:

spark.executor.extraClassPath=/root/infra.jar
spark.driver.extraClassPath=/root/infra.jar

I verified the file exists on all nodes and verified the configuration was pushed to $SPARK_HOME/conf/spark-defaults.conf on all nodes.

Actually I have also copied it to $SPARK_HOME/jars on all nodes.

I run the job using:

$SPARK_HOME/bin/spark-submit --master yarn --class com.MyMainClass /root/program.jar

I have the following env variables:

export HADOOP_CONF_DIR=/usr/hdp/2.6.3.0-235/hadoop/conf
export HADOOP_HOME=/usr/hdp/2.6.3.0-235/hadoop

I am getting an error:

Caused by: java.lang.ClassNotFoundException: com.MyPartition
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

MyPartition class is located in infra.jar but for some reason it is not found.

This code looks like it is called in the Executor , also some code does run - which is the driver code.

P.S. I tried adding the jar manually either by --jars flag or by addJars it still fails with ClassNotFoundException(although for some weird reason on different class from infra.jar)