Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Using shellaction to run spark-submit. status tracking/Yarn

Highlighted

Using shellaction to run spark-submit. status tracking/Yarn

New Contributor

Hi, I am tasked to integrated Oozie as a manager for our Spark queries. Ive been doing research for last couple of days but running into a wall. The examples I find only briefly touch on the subject.

Setup:

The spark queries run over HBase and do some calculations. Currently I use spark-submit to run the spark job in local mode on our server. (production will not be a cluster but will still have Yarn on it to manage the stack)

its standard like this: spark-submit --class com.mypackage.MyClass path/to/my.jar

Now I need to add Oozie to schedule spark to run on intervals (and to add more complex procedures later) Ive uploaded the jar to HDFS and have a workflow.xml on there as well. Unfortunately I cannot use SparkAction as I am pegged at a lower oozie version so using shellaction instead:

<workflow-app name="frequent_location" xmlns="uri:oozie:workflow:0.4">
  <start to="frequent_location"/>
  <action name="frequent_location">
  <shell xmlns="uri:oozie:shell-action:0.1">
  <job-tracker>${jobTracker}</job-tracker>
  <name-node>${nameNode}</name-node>
  <exec>spark-submit --class com.mypackage.MyClass /user/hue/oozie/workspaces/_hue_-oozie-1-1457990185.41/my.jar</exec>
  </shell>
  <ok to="end"/>
  <error to="kill"/>
  </action>
  <kill name="kill">
  <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
  </kill>
  <end name="end"/>
</workflow-app>

and a job.properties (also very standard):

nameNode=hdfs://sandbox.hortonworks.com:8020
jobTracker=sandbox.hortonworks.com:8050
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
oozie.wf.application.path=${nameNode}/user/hue/oozie/workspaces/_hue_-oozie-1-1457990185.41/workflow.xml 

then I do sudo su oozie (bash into oozie user) and

oozie job -oozie http://localhost:11000/oozie -config job.properties -run

now in oozie logs i see these not-so-great sounding messages:

B[0000020-160314153256895-oozie-oozi-W] ACTION[0000020-160314153256895-oozie-oozi-W@frequent_location] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1] 2016-03-16 17:45:08,945 INFO

ActionEndXCommand:543 - SERVER[sandbox.hortonworks.com] USER[oozie] GROUP[-] TOKEN[] APP[frequent_location] JOB[0000020-160314153256895-oozie-oozi-W] ACTION[0000020-160314153256895-oozie-oozi-W@frequent_location] ERROR is considered as FAILED for SLA

the hadoop job tracker http://localhost:19888/jobhistory says the job SUCCEEDED. i am doubtful.

Questions:

1) What is the correct way to run spark jobs via Oozie. I also have Yarn on this system. Should I be running spark in Yarn cluster mode, even though I will only have this one server with the entire stack on it.

2) How does Oozie track Success/Fail of shell commands? Do I need to return something from my Scala spark procedure to correctly set status? What happens if it throws an exception. How does spark-submit then propagate that to Oozie? This is somewhat critical if we have to add more complex procedures (like chained spark procedures).

8 REPLIES 8

Re: Using shellaction to run spark-submit. status tracking/Yarn

I can only comment on question #1, but I would say, for sure you want to run in one of the yarn modes (yarn-client or yarn-cluster), if only so oozie can get information on container usage and job completion. You can do this by using the --master command-line argument:

	spark-submit --master yarn-client --class com.mypackage.MyClass /user/hue/oozie/workspaces/_hue_-oozie-1-1457990185.41/my.jar

Re: Using shellaction to run spark-submit. status tracking/Yarn

New Contributor

kinda figured that if i have yarn on the system i should be running in one of the yarn modes. Ill give it a try to see if it works better.

Re: Using shellaction to run spark-submit. status tracking/Yarn

1. The best way to run spark job through oozie is by directly using oozie spark action.

https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html

2. I think by default Oozie gets the exit status of shell command while running shell action to determine wether job get succeed or failed.

Re: Using shellaction to run spark-submit. status tracking/Yarn

Expert Contributor

1. Although the best way to run spark job in oozie is spark action, it is not supported by Hortonworks yet (see also this) Also there are tweaks and workaround etc you need to make it work.

2. the exit status is correct for SparkAction as far as I have experienced, but it may not be correct for other action (especially for older spark version). see as an example: https://issues.apache.org/jira/browse/SPARK-7736

Apart from these, your setup doesnt seem to be correct (one thing I notice is the tracker port). See this thread for the correct setup

Re: Using shellaction to run spark-submit. status tracking/Yarn

New Contributor

Hi, as mentioned in the post I am pegged at a lower oozie version that does not support SparkAction. I am limited to shell or java action. Ive noticed a lot of posts on this forum where people are saying the port 8032 is correct for the tracker. However the yarn.resourcemanager.address is explicitly set to 8050 in yarn config.

2869-screenshot-from-2016-03-17-101557.png

Re: Using shellaction to run spark-submit. status tracking/Yarn

Expert Contributor

@Alex C : yes you need to change this yarn.resourcemanager.address to 8032, together with job tracker port in oozie job defn. This is the only port number that works at the moment.

Re: Using shellaction to run spark-submit. status tracking/Yarn

New Contributor

as explicitly stated int he post i am pegged at a lower oozie version:

[lhserver@sandbox ~]$ oozie version Oozie client build version: 4.1.0.2.2.4.2-2

Re: Using shellaction to run spark-submit. status tracking/Yarn

Expert Contributor

Pay attention to format of shell action arguments, it should be like

<exec>java</exec>
<argument>-classpath</argument>
<argument>$CLASSPATH</argument>
<argument>Hello</argument>

instead of single command.

Also be aware that shell command is executed on arbitrary node of the cluster, so all tools you're using have to be preinstalled on all the nodes.. Is not your case for now, cuz you're using single node sandbox, but it might be a problem in production

Regards