Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Oozie - Create a Spark workflow

avatar
Explorer

Hi,

 

I try to create a workflow into oozie with a spark job, I read the documentation with the two files, job.properties and workflow.xml, but I have a problem :

 

My spark job use local file, so I don't want to use HDFS to execute it.

 

export HADOOP_USER_NAME=hdfs;spark-submit --master yarn-cluster --class=com.budgetbox.decisionnel.aggregator.launcher.batch.AggregatorDataFlowBatch --conf "spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote -XX:+UseG1GC " --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties  -XX:+UseG1GC -XX:+UseCompressedOops" --num-executors 3 --executor-cores 4 --executor-memory 30GB  --driver-memory 8g /root/scripts/aggregator-dataflow-assembly-3.0.7-SNAPSHOT.jar --action FILE_CONSOLIDATOR --pathAction /data/bbox/realtime/action --pathSessionActivity /data/bbox/realtime/session-activity --pathFlags /data/bbox/flags --consolidateSuffix -consolidate --writeRepartition 48

 

We've created a bash script and placed it in crontab, but we are thinking that oozie is a better solution.

 

It's hard to understand how to complete job.properties ans workflow.xml files because of local files used by my spark job.

 

Do I need to create a new Spark Workflow or can I just create a new shell action workflow and execute the script ?

 

Can you help me about this subject ?

1 ACCEPTED SOLUTION

avatar
Explorer

Hi GeKas,

 

Regarding the rest. First of all you don't have to be a scala developer to schedule a script in oozie 🙂

Right, but as System / Big Data administrator it's usually not in my scope, but it's better to know for sure 😛

 

So, now It works with this xml syntax (workflow.xml), I found the correct way with shell action two days ago, and i've implemented several jobs with this workflow as it's a generic syntax as well with variables :

 

<workflow-app name="${wfname}" xmlns="uri:oozie:workflow:0.5">
    <start to="shellAction"/>
    <action name="shellAction">
        <shell xmlns="uri:oozie:shell-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.job.queue.name</name>
                <value>${queueName}</value>
            </property>
        </configuration>
        <exec>${command}</exec>
        <file>${scriptfile}</file>
        <file>${jarfile}</file>
        <capture-output/>
        </shell>
        <ok to="end"/>
        <error to="killAction"/>
    </action>
    <kill name="killAction">
    <message>"Killed job due to error"</message>
    </kill>
    <end name="end"/>
</workflow-app>

I have my job.properties with my variables locally, I put the workflow.xml in HDFS in the specified directory, and the jar into a subdirectory named "lib". It execute the spark shell without error, it's perfect !

 

Thank you for your support for the workflow, it's not easy to understand at the beginning.

 

I still don't understand all the syntax, for example I need to add an action into my workflow for send emails, I found email actions but it does not work (I don't have a smtp server for the moment, but it can fail on email sending it doesn't matter, I juste want to have the correct syntax without oozie errors first).

 

Now, I try to schedule a job with a coordinator, but it doesn't work, I have an issue with timezone, I put "Europe/Paris" in the timezone field, but it's not the correct time for execution, always a difference between what I want and the time printed.

 

I've already configured the timezone in hue configuration, but there is a difference, Hue and Oozie not printed the same time (I think Oozie is UTC, and probably HUE in my timezone)

 

 

And my last question, is there a way to record my submitted job into Hue ? For example, I've write on my own the workflow and I want to submit it to Hue (easier for the customer).

View solution in original post

15 REPLIES 15

avatar
Super Collaborator
As you said, when it's executed, we don't know on which yarn node the command will be executed. So, with this storage bay mounted on each node, it doesn't matter on which node it's executed (I think :p)

 Correct

 

Regarding the rest. First of all you don't have to be a scala developer to schedule a script in oozie 🙂
Your command should be "./runConsolidator.sh". Important is that your script has execute permissions and you define it in "Files".
How it works: this shell action will be a YARN job, so YARN will create a temp folder e.g. "/yar/nm/some_id/another_id". All files defined in "Files" of this action, will automatically downloaded into this directory. This directory will be your working directory, so you should run your command with "./" in front, since by default, "./" is not defined in PATH.

 

NOTE: If your script is using jar files etc. then you should define all of them in "Files", so they will copied to the working directory.


I suggest to proceed with this approach. Setting the xml will mess things and you need some experience to do it and avoid mistakes. Once you create a working job from HUE, you can export the xml and start playing.

avatar
Explorer

Hi GeKas,

 

Regarding the rest. First of all you don't have to be a scala developer to schedule a script in oozie 🙂

Right, but as System / Big Data administrator it's usually not in my scope, but it's better to know for sure 😛

 

So, now It works with this xml syntax (workflow.xml), I found the correct way with shell action two days ago, and i've implemented several jobs with this workflow as it's a generic syntax as well with variables :

 

<workflow-app name="${wfname}" xmlns="uri:oozie:workflow:0.5">
    <start to="shellAction"/>
    <action name="shellAction">
        <shell xmlns="uri:oozie:shell-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.job.queue.name</name>
                <value>${queueName}</value>
            </property>
        </configuration>
        <exec>${command}</exec>
        <file>${scriptfile}</file>
        <file>${jarfile}</file>
        <capture-output/>
        </shell>
        <ok to="end"/>
        <error to="killAction"/>
    </action>
    <kill name="killAction">
    <message>"Killed job due to error"</message>
    </kill>
    <end name="end"/>
</workflow-app>

I have my job.properties with my variables locally, I put the workflow.xml in HDFS in the specified directory, and the jar into a subdirectory named "lib". It execute the spark shell without error, it's perfect !

 

Thank you for your support for the workflow, it's not easy to understand at the beginning.

 

I still don't understand all the syntax, for example I need to add an action into my workflow for send emails, I found email actions but it does not work (I don't have a smtp server for the moment, but it can fail on email sending it doesn't matter, I juste want to have the correct syntax without oozie errors first).

 

Now, I try to schedule a job with a coordinator, but it doesn't work, I have an issue with timezone, I put "Europe/Paris" in the timezone field, but it's not the correct time for execution, always a difference between what I want and the time printed.

 

I've already configured the timezone in hue configuration, but there is a difference, Hue and Oozie not printed the same time (I think Oozie is UTC, and probably HUE in my timezone)

 

 

And my last question, is there a way to record my submitted job into Hue ? For example, I've write on my own the workflow and I want to submit it to Hue (easier for the customer).

avatar
Super Collaborator
You mean how the user can submit the job from HUE? If you save the file in HDFS as "workflow.xml", go to File browser using HUE. You will notice that if you select the check button of this file, you a "Submit" action button will appear. So the user can just hit it.

avatar
Explorer

Cool, so easy 🙂

 

With my workflow, do you have any idea about email sending ? I know we have an emailAction on Oozie, but it don't work, maybe a bad syntax, I think we must edit the "kill" and "end" section ?

avatar
Super Collaborator

If it doesn't work, then probably the e-mail in action will not work.

Try to send a mail from console:

mail user@mail.address -s -s test_subject << EOF
This is a test e-mail
EOF

There are various cases to send an e-mail. If you need an e-mail if an error is encountered, or action takes too much time, then you have to enable SLAs on this action and define the recipient.

If you need an e-mail that the workflow executed successfully, then add a mail action just before the end of the workflow. If any previous action fails, the e-mail action will not executed. Unless you have modified the kill in one of your actions of course, and lead it to this e-mail action.

You have multiple options to cover multiple scenarios.

avatar
Explorer

The developer (customer side) who work with me on the cluster try to use Apache Airflow, and after one week, he can do what we need (workflow, emailing / alerting, re-run, ...) without the load of files into hdfs, Apache airflow is running in standalone mode and the web UI is better than Oozie UI.

 

It seems a better solution than oozie, what do you think about this ?

 

As it is an incubating project, I don't know if it's a good idea, but the web UI is good, it looks easy to manage, I didn't know this new project but I think Oozie is outdated compare to Airflow.

 

For the moment Oozie is in stand-by, they will make a choice between oozie and airflow, but I must admit that Airflow looks a better solution.