Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What dependencies to submit Spark jobs programmatically (not via spark-submit)?

Solved Go to solution

What dependencies to submit Spark jobs programmatically (not via spark-submit)?

 SparkConf conf = new SparkConf();
conf.setAppName("Test");
conf.setMaster("yarn-cluster"); // or with "yarn-client"
JavaSparkContext context = new JavaSparkContext(conf);

I am trying to run the above code against CDH 5.2 and CDH 5.3. I receive the errors below (note I am trying both "yarn-client" and "yarn-cluster"). I noticed in the Cloudera maven repo there doesn't seem to be a stable (non SNAPSHOT) spark-yarn [1] or yarn-parent [2] for CDH 5.2 and 5.3. Is this supported? Any tips?

 

[1] https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-yarn_2.10/

[2] https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/yarn-parent_2.10/

 

// with yarn-client

java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/conf/YarnConfiguration
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:171)
	at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:102)
	at org.apache.spark.deploy.SparkHadoopUtil$.<init>(SparkHadoopUtil.scala:101)
	at org.apache.spark.deploy.SparkHadoopUtil$.<clinit>(SparkHadoopUtil.scala)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:228)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:53)
	at com.bretthoerner.TestSpark.testSpark(TestSpark.java:111)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runners.Suite.runChild(Suite.java:127)
	at org.junit.runners.Suite.runChild(Suite.java:26)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.conf.YarnConfiguration
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	... 44 more

 

// with yarn-cluster

org.apache.spark.SparkException: YARN mode not available ?
	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1561)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:310)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:53)
	at com.bretthoerner.TestSpark.testSpark(TestSpark.java:116)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runners.Suite.runChild(Suite.java:127)
	at org.junit.runners.Suite.runChild(Suite.java:26)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.scheduler.cluster.YarnClusterScheduler
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:171)
	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1554)
	... 39 more

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: What dependencies to submit Spark jobs programmatically (not via spark-submit)?

Master Collaborator

Nah it's not crazy, just means you have to do some of the work that spark-submit does. Almost all of that is dealing with the classpath. If you're just trying to get a simple app running I think spark-submit is the way to go. But if you're building a more complex product or service you might have to embed Spark and deal with this. 

 

Example, I had to do just this recently, and here's what I came up with: https://github.com/OryxProject/oryx/tree/master/bin

 

In future versions (like 1.4+) there's going to be a more proper programmatic submission API. I know Marcelo here has been working on that.

17 REPLIES 17

Re: What dependencies to submit Spark jobs programmatically (not via spark-submit)?

Master Collaborator

In general, you do not run Spark applications directly as Java programs. You have to run them with spark-submit, which sets up the classpath for you. Otherwise you have to set it up, and that's the problem here; you didn't put all of the many YARN / Hadoop / Spark jars on your classpath.

 

spark-yarn and yarn-parent were discontinued in 1.2.0, but then brought back very recently for 1.2.1. You can see it doesn't exist upstream for 1.2.0: http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.spark%22%20AND%20a%3A%22yarn-parent_2....

 

CDH 5.3 was based on 1.2.0, so that's why.

 

That said, that is not the artifact you are missing here. You don't even have YARN code on your classpath.

Highlighted

Re: What dependencies to submit Spark jobs programmatically (not via spark-submit)?

Thanks,

 

I guess my followup is... am I insane for wanting to do it that way? Do other people with existing JVM based apps that need to submit jobs actually use a ProcessBuilder to run spark-submit?

Re: What dependencies to submit Spark jobs programmatically (not via spark-submit)?

Master Collaborator

Nah it's not crazy, just means you have to do some of the work that spark-submit does. Almost all of that is dealing with the classpath. If you're just trying to get a simple app running I think spark-submit is the way to go. But if you're building a more complex product or service you might have to embed Spark and deal with this. 

 

Example, I had to do just this recently, and here's what I came up with: https://github.com/OryxProject/oryx/tree/master/bin

 

In future versions (like 1.4+) there's going to be a more proper programmatic submission API. I know Marcelo here has been working on that.

Re: What dependencies to submit Spark jobs programmatically (not via spark-submit)?

Explorer

Hi

 

I am trying to do the same thing but the code link you has moved. Can you point me to the location within the oryx code base that I should look

 

Thanks in advance

 

 

Re: What dependencies to submit Spark jobs programmatically (not via spark-submit)?

Master Collaborator

Re: What dependencies to submit Spark jobs programmatically (not via spark-submit)?

New Contributor

@sowen how do you manage Oryx dependency conflicts with that of Spark? What Spark jars should one include for launching a Spark job programatically within java code? I have been running into a few conflicts, latest one being

 

java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder

Re: What dependencies to submit Spark jobs programmatically (not via spark-submit)?

Master Collaborator

The Spark, Hadoop and Kafka dependencies are 'provided' by the cluster at runtime and not included in the app. Other dependencies you must bundle with your app. In the case that they conflict with dependencies that leak from Spark you can usually use the user-classpath-first properties to work around them.

Re: What dependencies to submit Spark jobs programmatically (not via spark-submit)?

Explorer

Out of curiosity, is there a JIRA task for any work relating to the programmatic API to submit spark jobs?

Re: What dependencies to submit Spark jobs programmatically (not via spark-submit)?

Master Collaborator

It's already complete and in 1.4. This was the initial JIRA  https://issues.apache.org/jira/browse/SPARK-4924

Don't have an account?
Coming from Hortonworks? Activate your account here