Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Unable to remove fork/join tags from Hue/Oozie editor after removing parallel steps.

Unable to remove fork/join tags from Hue/Oozie editor after removing parallel steps.

Explorer

I'm a bit new to working with Oozie, so please bear with me if I'm missing something basic.

 

I'm creating an Oozie workflow that executes the terasort.jarcom.github.ehiggs.spark.terasort libraries on a Cloudera cluster.

 

After a fair bit of struggling, I was able to get it working, but I wasn't satisfied with the process.  

 

I had originally created three hadoop fs actions to delete the output directories produced by previous TeraGen/TeraSort/TeraValidate executions and then I used the Hue Oozie editor to make them parallel.  After the parallel steps, I added steps to execute the Spark programs. (Documentation on how to get the Spark steps working correctly is a bit incomplete right now online.)  After I got everything working, I then looked at ways to optimize the process.

 

First, I saw how to perform the file system steps as part of the <prepare> portion of the TeraGen step.  This would allow me to both eliminate the parallel steps and also make the status bar give a more accurate reading of how far along the process had executed.

 

Once I had added the directory deletion steps in the <prepare> section, I then deleted the parallel actions to perform the deletions.  This is where I ran into problems.  I started getting an error:

 

E0701: XML schema error, cvc-complex-type.2.4.b: The content of element 'fork' is not complete. One of '{"uri:oozie:workflow:0.5":path}' is expected.

 

When I look at the Oozie XML that's generated by the Hue editor, I can see that the <fork> and <join> actions weren't deleted at the same time I deleted the parallel steps.  I haven't figured out how to delete them.  I can edit the file, but the next time I go into the graphical editor, it overwrites my edits and re-adds the incomplete steps. (see the snippet from the workflow.xml file and the graphical workflow below).

 

Is there a good way to fix this?  I've started another separate Oozie workflow, but I'm struggling again with getting the Spark actions to work correctly.

 

Thanks in advance.

 

David Webb

 

<workflow-app name="TeraGen_-_TeraSort_-_TeraValidate_-_1GB-" xmlns="uri:oozie:workflow:0.5">
  <start to="spark-e177"/>
  <kill name="Kill">
    <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
  </kill>
  <fork name="fork-736f">
  </fork>
  <join name="join-bd18" to="End"/>
  <action name="spark-e177">
    <spark xmlns="uri:oozie:spark-action:0.1">
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <prepare>
        <delete path="${nameNode}/user/davidw/teravalidate-benchmark.out"/>
        <delete path="${nameNode}/user/davidw/terasort-benchmark.out"/>
        <delete path="${nameNode}/user/davidw/terasort-benchmark.in"/>
      </prepare>

...

 

TeraGen - TeraSort - TeraValidate - 1GB-copy
Execute TeraGen, TeraSort, and TeraValidate with a 1GB dataset
 
 
Spark - Execute TeraGen - 1GB
TeraGen    
/user/hue/oozie/workspaces/hue-oozie-1449671502.72/lib/spark-terasort.jarcom.github.ehiggs.spark.terasort.TeraGen
 
 
 
Spark - Execute TeraGen - 1GB
TeraGen    
/user/hue/oozie/workspaces/hue-oozie-1449671502.72/lib/spark-terasort.jarcom.github.ehiggs.spark.terasort.TeraSort
 
 
 
Spark - Execute TeraGen - 1GB
TeraGen    
/user/hue/oozie/workspaces/hue-oozie-1449671502.72/lib/spark-terasort.jarcom.github.ehiggs.spark.terasort.TeraValidate

 

 

1 REPLY 1

Re: Unable to remove fork/join tags from Hue/Oozie editor after removing parallel steps.

Explorer
Just a quick update. I also found that if I set the transition on the first step to "End" and then I delete the first step, the saved workflow begins with <Start to="End">