Support Questions

Find answers, ask questions, and share your expertise

What dependencies to submit Spark jobs programmatically (not via spark-submit)?

avatar
 SparkConf conf = new SparkConf();
conf.setAppName("Test");
conf.setMaster("yarn-cluster"); // or with "yarn-client"
JavaSparkContext context = new JavaSparkContext(conf);

I am trying to run the above code against CDH 5.2 and CDH 5.3. I receive the errors below (note I am trying both "yarn-client" and "yarn-cluster"). I noticed in the Cloudera maven repo there doesn't seem to be a stable (non SNAPSHOT) spark-yarn [1] or yarn-parent [2] for CDH 5.2 and 5.3. Is this supported? Any tips?

 

[1] https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-yarn_2.10/

[2] https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/yarn-parent_2.10/

 

// with yarn-client

java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/conf/YarnConfiguration
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:171)
	at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:102)
	at org.apache.spark.deploy.SparkHadoopUtil$.<init>(SparkHadoopUtil.scala:101)
	at org.apache.spark.deploy.SparkHadoopUtil$.<clinit>(SparkHadoopUtil.scala)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:228)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:53)
	at com.bretthoerner.TestSpark.testSpark(TestSpark.java:111)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runners.Suite.runChild(Suite.java:127)
	at org.junit.runners.Suite.runChild(Suite.java:26)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.conf.YarnConfiguration
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	... 44 more

 

// with yarn-cluster

org.apache.spark.SparkException: YARN mode not available ?
	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1561)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:310)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:53)
	at com.bretthoerner.TestSpark.testSpark(TestSpark.java:116)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runners.Suite.runChild(Suite.java:127)
	at org.junit.runners.Suite.runChild(Suite.java:26)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.scheduler.cluster.YarnClusterScheduler
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:171)
	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1554)
	... 39 more

 

 

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Nah it's not crazy, just means you have to do some of the work that spark-submit does. Almost all of that is dealing with the classpath. If you're just trying to get a simple app running I think spark-submit is the way to go. But if you're building a more complex product or service you might have to embed Spark and deal with this. 

 

Example, I had to do just this recently, and here's what I came up with: https://github.com/OryxProject/oryx/tree/master/bin

 

In future versions (like 1.4+) there's going to be a more proper programmatic submission API. I know Marcelo here has been working on that.

View solution in original post

17 REPLIES 17

avatar
Master Collaborator

In general, you do not run Spark applications directly as Java programs. You have to run them with spark-submit, which sets up the classpath for you. Otherwise you have to set it up, and that's the problem here; you didn't put all of the many YARN / Hadoop / Spark jars on your classpath.

 

spark-yarn and yarn-parent were discontinued in 1.2.0, but then brought back very recently for 1.2.1. You can see it doesn't exist upstream for 1.2.0: http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.spark%22%20AND%20a%3A%22yarn-parent_2....

 

CDH 5.3 was based on 1.2.0, so that's why.

 

That said, that is not the artifact you are missing here. You don't even have YARN code on your classpath.

avatar

Thanks,

 

I guess my followup is... am I insane for wanting to do it that way? Do other people with existing JVM based apps that need to submit jobs actually use a ProcessBuilder to run spark-submit?

avatar
Master Collaborator

Nah it's not crazy, just means you have to do some of the work that spark-submit does. Almost all of that is dealing with the classpath. If you're just trying to get a simple app running I think spark-submit is the way to go. But if you're building a more complex product or service you might have to embed Spark and deal with this. 

 

Example, I had to do just this recently, and here's what I came up with: https://github.com/OryxProject/oryx/tree/master/bin

 

In future versions (like 1.4+) there's going to be a more proper programmatic submission API. I know Marcelo here has been working on that.

avatar
Explorer

Hi

 

I am trying to do the same thing but the code link you has moved. Can you point me to the location within the oryx code base that I should look

 

Thanks in advance

 

 

avatar
New Contributor

@sowen how do you manage Oryx dependency conflicts with that of Spark? What Spark jars should one include for launching a Spark job programatically within java code? I have been running into a few conflicts, latest one being

 

java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder

avatar
Master Collaborator

The Spark, Hadoop and Kafka dependencies are 'provided' by the cluster at runtime and not included in the app. Other dependencies you must bundle with your app. In the case that they conflict with dependencies that leak from Spark you can usually use the user-classpath-first properties to work around them.

avatar
Explorer

Out of curiosity, is there a JIRA task for any work relating to the programmatic API to submit spark jobs?

avatar
Master Collaborator

It's already complete and in 1.4. This was the initial JIRA  https://issues.apache.org/jira/browse/SPARK-4924