Support Questions

ryan_templeton · ‎02-02-2016

I am porting an ETL process from a relational data warehouse which is comprised of about 30 queries into Hive. The query migration is complete and when run serially takes about an hour. Some of the queries have dependencies while others do not. I'd like to interleave the queries that don't have dependencies so that the job can finish faster. Is there a way I can run coordinate concurrent queries like this through Oozie or Falcon or some other library?

bleonhardi · ‎02-02-2016

Yes you can use oozie. Let's concentrate on this because to run queries in parallel efficiently you most likely will need an oozie workflow anyway. ( Falcon can kick off oozie workflows ) Regardless if you use falcon or a classic oozie coordinator to schedule them.

What oozie workflows provide is the ability to create an execution graph where each action can continue on to any other node. ( cycles are forbidden ) . To allow parallel action execution you have forks and joins. A fork starts two actions in parallel and a join waits for all actions it waits on to finish. So you can pretty much create any structure you want.

The example below is very simple but you could also have fork in a fork etc. pp. There are surely other ways as well but Oozie most likely will be the canonical way of doing it.

For example:

<start to="check-files"/>
 
    <fork name="parallel-load">
        <path start="load1"/>
        <path start="load2"/>
    </fork>
  
    
    <action name="load1">
        <hive2 xmlns="uri:oozie:hive2-action:0.1">
           <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <jdbc-url>jdbc:hive2://hiveserver:10000/default</jdbc-url>
            <password>${hivepassword}</password>
            <script>/data/sql/load1.sql</script>
        </hive2>
        <ok to="join-node"/>
        <error to="kill"/>
    </action>
    
    <action name="load2">
        <hive2 xmlns="uri:oozie:hive2-action:0.1">
           <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <jdbc-url>jdbc:hive2://hiveserver:10000/default</jdbc-url>
            <password>${hivepassword}</password>
            <script>/data/sql/load2.sql</script>
        </hive2>
        <ok to="join-node"/>
        <error to="kill"/>
    </action>
    <join name="join-node" to="end"/>

View solution in original post

Shelton · ‎02-02-2016

The Oozie coordinator supports a very flexible data dependency–based triggering framework. It is important to note that the concept of data availability–based scheduling is a little more involved than time-based triggering.

Use Oozie bundle which is a collection of Oozie coordinator applications with a directive on when to kick off those coordinators.Bundles can be started, stopped, suspended, and managed as a single entity instead of managing each individual coordinator that it’s composed of. This is a very useful level of abstraction in many large enterprises. These data pipelines can get rather large and complicated, and the ability to manage them as a single entity instead of meddling with the individual parts brings a lot of operational benefits.

bleonhardi · ‎02-02-2016

Yes you can use oozie. Let's concentrate on this because to run queries in parallel efficiently you most likely will need an oozie workflow anyway. ( Falcon can kick off oozie workflows ) Regardless if you use falcon or a classic oozie coordinator to schedule them.

What oozie workflows provide is the ability to create an execution graph where each action can continue on to any other node. ( cycles are forbidden ) . To allow parallel action execution you have forks and joins. A fork starts two actions in parallel and a join waits for all actions it waits on to finish. So you can pretty much create any structure you want.

The example below is very simple but you could also have fork in a fork etc. pp. There are surely other ways as well but Oozie most likely will be the canonical way of doing it.

For example:

<start to="check-files"/>
 
    <fork name="parallel-load">
        <path start="load1"/>
        <path start="load2"/>
    </fork>
  
    
    <action name="load1">
        <hive2 xmlns="uri:oozie:hive2-action:0.1">
           <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <jdbc-url>jdbc:hive2://hiveserver:10000/default</jdbc-url>
            <password>${hivepassword}</password>
            <script>/data/sql/load1.sql</script>
        </hive2>
        <ok to="join-node"/>
        <error to="kill"/>
    </action>
    
    <action name="load2">
        <hive2 xmlns="uri:oozie:hive2-action:0.1">
           <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <jdbc-url>jdbc:hive2://hiveserver:10000/default</jdbc-url>
            <password>${hivepassword}</password>
            <script>/data/sql/load2.sql</script>
        </hive2>
        <ok to="join-node"/>
        <error to="kill"/>
    </action>
    <join name="join-node" to="end"/>

Cloudera Community

Support Questions

How can I coordinate execution of Hive queries