Created 07-10-2017 11:21 AM
Hi,
I am scheduling an oozie workflow with falcon. The WF is responsible for executing a shell script that runs spark-submit. Sometimes it is working and the jobs end successfully, but most of the time they get killed. In the oozie error logs, there are a few warnings like:
2017-07-06 14:10:01,907 WARN ParameterVerifier:523 - SERVER[<host>] USER[ambari-qa] GROUP[-] TOKEN[] APP[FALCON_PROCESS_DEFAULT_estimatePi7] JOB[0000040-170706133706258-oozie-oozi-W] ACTION[0000040-170706133706258-oozie-oozi-W@user-action] The application does not define formal parameters in its XML definition 2017-07-06 14:10:01,952 WARN LiteWorkflowAppService:523 - SERVER[<host>] USER[ambari-qa] GROUP[-] TOKEN[] APP[FALCON_PROCESS_DEFAULT_estimatePi7] JOB[0000040-170706133706258-oozie-oozi-W] ACTION[0000040-170706133706258-oozie-oozi-W@user-action] libpath [hdfs://<host>:8020/user/oozie/shell/lib] does not exist 2017-07-06 14:10:02,202 WARN CompletedActionXCommand:523 - SERVER[<host>] USER[-] GROUP[-] TOKEN[] APP[-] JOB[0000040-170706133706258-oozie-oozi-W] ACTION[0000040-170706133706258-oozie-oozi-W@user-action] Received early callback for action still in PREP state; will wait [10,000]ms and requeue up to [5] more times 2017-07-07 07:43:10,658 WARN ShellActionExecutor:523 - SERVER[<host>] USER[ambari-qa] GROUP[-] TOKEN[] APP[ShellAction] JOB[0000007-170707072402346-oozie-oozi-W] ACTION[0000007-170707072402346-oozie-oozi-W@shellAction] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]
Since they are warnings, I don't think they're the reason why the jobs are failing. Nevertheless I've tried to get rid of them and failed...
For the 'Received early callback for action still in PREP state' warning I've added a 30 second sleep in my shell script, but the warning still occurs occasionally.
For the libpath does not exist warning, I've added the following property to the oozie-site.xml
oozie.libpath=${nameNode}/user/oozie/share/lib
I've also added this to my job.properties file and falcon process. The warning still states the libs are missing from /user/oozie/shell/lib. Are these even related? It seems like oozie is searching for lib in the directory I specified for my workflow.xml file
I don't have any idea what to do about the Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1] warning. Can't find anything related to it.
Does anybody have any idea what might cause the jobs to fail?
Below are the configurations files
workflow.xml
<workflow-app name="ShellAction" xmlns="uri:oozie:workflow:0.4"> <start to="shellAction"/> <action name="shellAction"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>script.sh</exec> <file>/user/oozie/shell/job.properties#job.properties</file> <file>/user/oozie/shell/script.sh#script.sh</file> <file>/user/oozie/shell/PiEstimation.jar#PiEstimation.jar</file> <capture-output/> </shell> <ok to="end"/> <error to="killAction"/> </action> <kill name="killAction"> <message>"Killed job due to error"</message> </kill> <end name="end"/> </workflow-app>
job.properties
nameNode=hdfs://[<host>]:8020 jobTracker=[<host>]:8050 queueName=default oozie.wf.application.path=${nameNode}/user/oozie/shell oozie.libpath=${nameNode}/user/${user.name}/share/lib oozie.use.system.libpath=true
script.sh
sleep 30 /usr/hdp/current/spark-client/bin/spark-submit --class org.apache.falcon.example.spark.SparkPI --conf spark.ui.port=4050 --driver-memory 2g --executor-memory 1g /apps/spark/PiEstimation.jar 100 >> /apps/spark/PiEstimationOut.log
falcon process
<process xmlns='uri:falcon:process:0.1' name='estimatePi7'> <clusters> <cluster name='primaryCluster'> <validity start='2017-07-07T07:10Z' end='2017-07-07T07:45Z'/> </cluster> </clusters> <parallel>1</parallel> <order>LIFO</order> <frequency>minutes(5)</frequency> <timezone>UTC</timezone> <properties> <property name="oozie.libpath" value="${nameNode}/user/oozie/share/lib" /> </properties> <workflow name='ShellAction' engine='oozie' path='/user/oozie/shell/'/> <retry policy='periodic' delay='minutes(1)' attempts='3'/> <ACL owner='ambari-qa' group='users' permission='0755'/> </process>
attaching yarn logs
Created 07-28-2017 08:06 AM
Hi Rafal,
Can you please try the following?
Change the order of property, have oozie.use.system.libpath=true after queuename=default . Execute once and check.
Change oozie.libpath value to ${nameNode}/user/oozie/share/lib/lib_<timestamp> (timestamp is the time stamp number you see under this folder). Execute once and check.
Copy all the jars from ${nameNode}/user/oozie/share/lib/*/*/* to any custom location like ${nameNode}/user/${user.name}/share/lib . Execute once and check.
One of them should workout.