Reply
Highlighted
Expert Contributor
Posts: 64
Registered: ‎11-24-2017

Oozie and data triggers

Hello everyone!

 

I have 2 coordinators, the first schedules a workflow which has several sqoop import actions and the second schedules a workflow with trasformation actions (like Spark and Hive) on the data produced by the first.

 

I need to make the second coordinator data-dependent from the first, so it should start only if the first coordinator has finished to import all data tables.

 

I am trying to understand how coordinator triggers work, but I have some doubts about <output-events> and the <done-flag>.

 

I have built a simple example to try how things work, basically it imports 2 mysql tables in sequence and should produce dataset instances using the:

 

  coordinator.xml

<coordinator-app
	xmlns="uri:oozie:coordinator:0.2" frequency="${coord:minutes(30)}" name="COORDINATOR_MYSQL" start="${startDate}" end="${endDate}" timezone="UTC">
	<controls>
		<execution>FIFO</execution>
	</controls>
	<datasets>
		<dataset name="dataset1" frequency="${coord:minutes(12)}" initial-instance="2018-01-01T00:00Z" timezone="UTC">
			<uri-template>
				${nameNode}/user/cloudera/oozie/trigger-app/datasets/${YEAR}-${MONTH}-${DAY}__${HOUR}-${MINUTE}
			</uri-template>
			<done-flag>_TRIGGER</done-flag>
		</dataset>
	</datasets>
	<output-events>
		<data-out name="dataset1_output_event" dataset="dataset1">
			<instance>${coord:current(0)}</instance>
		</data-out>
	</output-events>
	<action>
		<workflow>
			<app-path>${nameNode}/user/cloudera/oozie/trigger-app/workflow.xml</app-path>
			<configuration>
				<property>
					<name>jobTracker</name>
					<value>${jobTracker}</value>
				</property>
				<property>
					<name>nameNode</name>
					<value>${nameNode}</value>
				</property>
				<property>
					<name>outputDir</name>
					<value>${coord:dataOut('dataset1_output_event')}</value>
				</property>
			</configuration>
		</workflow>
	</action>
</coordinator-app>

  workflow.xml

<workflow-app name="OOZIE_MYSQL_SQOOP_WF" xmlns="uri:oozie:workflow:0.4">

	<start to="sqoop_action_1" />

	<action name="sqoop_action_1">
		<sqoop
			xmlns="uri:oozie:sqoop-action:0.2">
			<job-tracker>${jobTracker}</job-tracker>
			<name-node>${nameNode}</name-node>
			<prepare>
				<delete path="${outputDir}/categories"/>
			</prepare>
			<arg>import</arg>
			<arg>--connect</arg>
			<arg>jdbc:mysql://localhost/retail_db</arg>
			<arg>--username</arg>
			<arg>root</arg>
			<arg>--password</arg>
			<arg>cloudera</arg>
			<arg>--table</arg>
			<arg>categories</arg>
			<arg>--split-by</arg>
			<arg>category_id</arg>
			<arg>--warehouse-dir</arg>
			<arg>${outputDir}</arg>
			<arg>--num-mappers</arg>
			<arg>1</arg>
		</sqoop>
		<ok to="sqoop_action_2"/>
		<error to="fail"/>
	</action>

	<action name="sqoop_action_2">
		<sqoop
			xmlns="uri:oozie:sqoop-action:0.2">
			<job-tracker>${jobTracker}</job-tracker>
			<name-node>${nameNode}</name-node>
			<prepare>
				<delete path="${outputDir}/products"/>
			</prepare>
			<arg>import</arg>
			<arg>--connect</arg>
			<arg>jdbc:mysql://localhost/retail_db</arg>
			<arg>--username</arg>
			<arg>root</arg>
			<arg>--password</arg>
			<arg>cloudera</arg>
			<arg>--table</arg>
			<arg>products</arg>
			<arg>--split-by</arg>
			<arg>product_id</arg>
			<arg>--warehouse-dir</arg>
			<arg>${outputDir}</arg>
			<arg>--num-mappers</arg>
			<arg>1</arg>
		</sqoop>
		<ok to="success"/>
		<error to="fail"/>
	</action>

	<kill name="fail">
		<message>JOB FAILED!</message>
	</kill>
	<end name="success"/>
</workflow-app>

 

   

 

 

 When I run the coordinator, after the first workflow instance is done, I can see the following results in the HDFS:

 

  hdfs dfs -ls oozie/trigger-app/datasets

oozie/trigger-app/datasets
oozie/trigger-app/datasets/2018-03-12__22-00
oozie/trigger-app/datasets/2018-03-12__22-00/categories
oozie/trigger-app/datasets/2018-03-12__22-00/categories/_SUCCESS
oozie/trigger-app/datasets/2018-03-12__22-00/categories/part-m-00000
oozie/trigger-app/datasets/2018-03-12__22-00/products
oozie/trigger-app/datasets/2018-03-12__22-00/products/_SUCCESS
oozie/trigger-app/datasets/2018-03-12__22-00/products/part-m-00000

I expected that Oozie produces the specified <done-flag> (_TRIGGER in this case) after both sqoop have been completed, but this didn't happen. The _TRIGGER flag is missing, instead there are two _SUCCESS flags inside each of the --data-warehouse directories.

 

Probably it's not clear to me how the <done-flag> works (and why here is missing).

 

What I would like to do is to generate only a single flag (like _TRIGGER or whatever) only after all the actions of the first workflow have been completed, and make the second coordinator dependent from this.

How can I do that?

 

 

 

  

 

Announcements