I'm a bit new to working with Oozie, so please bear with me if I'm missing something basic.
I'm creating an Oozie workflow that executes the terasort.jarcom.github.ehiggs.spark.terasort libraries on a Cloudera cluster.
After a fair bit of struggling, I was able to get it working, but I wasn't satisfied with the process.
I had originally created three hadoop fs actions to delete the output directories produced by previous TeraGen/TeraSort/TeraValidate executions and then I used the Hue Oozie editor to make them parallel. After the parallel steps, I added steps to execute the Spark programs. (Documentation on how to get the Spark steps working correctly is a bit incomplete right now online.) After I got everything working, I then looked at ways to optimize the process.
First, I saw how to perform the file system steps as part of the <prepare> portion of the TeraGen step. This would allow me to both eliminate the parallel steps and also make the status bar give a more accurate reading of how far along the process had executed.
Once I had added the directory deletion steps in the <prepare> section, I then deleted the parallel actions to perform the deletions. This is where I ran into problems. I started getting an error:
E0701: XML schema error, cvc-complex-type.2.4.b: The content of element 'fork' is not complete. One of '{"uri:oozie:workflow:0.5":path}' is expected.
When I look at the Oozie XML that's generated by the Hue editor, I can see that the <fork> and <join> actions weren't deleted at the same time I deleted the parallel steps. I haven't figured out how to delete them. I can edit the file, but the next time I go into the graphical editor, it overwrites my edits and re-adds the incomplete steps. (see the snippet from the workflow.xml file and the graphical workflow below).
Is there a good way to fix this? I've started another separate Oozie workflow, but I'm struggling again with getting the Spark actions to work correctly.
Thanks in advance.
David Webb
<workflow-app name="TeraGen_-_TeraSort_-_TeraValidate_-_1GB-" xmlns="uri:oozie:workflow:0.5">
<start to="spark-e177"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<fork name="fork-736f">
</fork>
<join name="join-bd18" to="End"/>
<action name="spark-e177">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/davidw/teravalidate-benchmark.out"/>
<delete path="${nameNode}/user/davidw/terasort-benchmark.out"/>
<delete path="${nameNode}/user/davidw/terasort-benchmark.in"/>
</prepare>
...
TeraGen - TeraSort - TeraValidate - 1GB-copy
Execute TeraGen, TeraSort, and TeraValidate with a 1GB dataset