Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Oozie and data triggers

Oozie and data triggers

Expert Contributor

Hello everyone!

 

I have 2 coordinators, the first schedules a workflow which has several sqoop import actions and the second schedules a workflow with trasformation actions (like Spark and Hive) on the data produced by the first.

 

I need to make the second coordinator data-dependent from the first, so it should start only if the first coordinator has finished to import all data tables.

 

I am trying to understand how coordinator triggers work, but I have some doubts about <output-events> and the <done-flag>.

 

I have built a simple example to try how things work, basically it imports 2 mysql tables in sequence and should produce dataset instances using the:

 

  coordinator.xml

<coordinator-app
	xmlns="uri:oozie:coordinator:0.2" frequency="${coord:minutes(30)}" name="COORDINATOR_MYSQL" start="${startDate}" end="${endDate}" timezone="UTC">
	<controls>
		<execution>FIFO</execution>
	</controls>
	<datasets>
		<dataset name="dataset1" frequency="${coord:minutes(12)}" initial-instance="2018-01-01T00:00Z" timezone="UTC">
			<uri-template>
				${nameNode}/user/cloudera/oozie/trigger-app/datasets/${YEAR}-${MONTH}-${DAY}__${HOUR}-${MINUTE}
			</uri-template>
			<done-flag>_TRIGGER</done-flag>
		</dataset>
	</datasets>
	<output-events>
		<data-out name="dataset1_output_event" dataset="dataset1">
			<instance>${coord:current(0)}</instance>
		</data-out>
	</output-events>
	<action>
		<workflow>
			<app-path>${nameNode}/user/cloudera/oozie/trigger-app/workflow.xml</app-path>
			<configuration>
				<property>
					<name>jobTracker</name>
					<value>${jobTracker}</value>
				</property>
				<property>
					<name>nameNode</name>
					<value>${nameNode}</value>
				</property>
				<property>
					<name>outputDir</name>
					<value>${coord:dataOut('dataset1_output_event')}</value>
				</property>
			</configuration>
		</workflow>
	</action>
</coordinator-app>

  workflow.xml

<workflow-app name="OOZIE_MYSQL_SQOOP_WF" xmlns="uri:oozie:workflow:0.4">

	<start to="sqoop_action_1" />

	<action name="sqoop_action_1">
		<sqoop
			xmlns="uri:oozie:sqoop-action:0.2">
			<job-tracker>${jobTracker}</job-tracker>
			<name-node>${nameNode}</name-node>
			<prepare>
				<delete path="${outputDir}/categories"/>
			</prepare>
			<arg>import</arg>
			<arg>--connect</arg>
			<arg>jdbc:mysql://localhost/retail_db</arg>
			<arg>--username</arg>
			<arg>root</arg>
			<arg>--password</arg>
			<arg>cloudera</arg>
			<arg>--table</arg>
			<arg>categories</arg>
			<arg>--split-by</arg>
			<arg>category_id</arg>
			<arg>--warehouse-dir</arg>
			<arg>${outputDir}</arg>
			<arg>--num-mappers</arg>
			<arg>1</arg>
		</sqoop>
		<ok to="sqoop_action_2"/>
		<error to="fail"/>
	</action>

	<action name="sqoop_action_2">
		<sqoop
			xmlns="uri:oozie:sqoop-action:0.2">
			<job-tracker>${jobTracker}</job-tracker>
			<name-node>${nameNode}</name-node>
			<prepare>
				<delete path="${outputDir}/products"/>
			</prepare>
			<arg>import</arg>
			<arg>--connect</arg>
			<arg>jdbc:mysql://localhost/retail_db</arg>
			<arg>--username</arg>
			<arg>root</arg>
			<arg>--password</arg>
			<arg>cloudera</arg>
			<arg>--table</arg>
			<arg>products</arg>
			<arg>--split-by</arg>
			<arg>product_id</arg>
			<arg>--warehouse-dir</arg>
			<arg>${outputDir}</arg>
			<arg>--num-mappers</arg>
			<arg>1</arg>
		</sqoop>
		<ok to="success"/>
		<error to="fail"/>
	</action>

	<kill name="fail">
		<message>JOB FAILED!</message>
	</kill>
	<end name="success"/>
</workflow-app>

 

   

 

 

 When I run the coordinator, after the first workflow instance is done, I can see the following results in the HDFS:

 

  hdfs dfs -ls oozie/trigger-app/datasets

oozie/trigger-app/datasets
oozie/trigger-app/datasets/2018-03-12__22-00
oozie/trigger-app/datasets/2018-03-12__22-00/categories
oozie/trigger-app/datasets/2018-03-12__22-00/categories/_SUCCESS
oozie/trigger-app/datasets/2018-03-12__22-00/categories/part-m-00000
oozie/trigger-app/datasets/2018-03-12__22-00/products
oozie/trigger-app/datasets/2018-03-12__22-00/products/_SUCCESS
oozie/trigger-app/datasets/2018-03-12__22-00/products/part-m-00000

I expected that Oozie produces the specified <done-flag> (_TRIGGER in this case) after both sqoop have been completed, but this didn't happen. The _TRIGGER flag is missing, instead there are two _SUCCESS flags inside each of the --data-warehouse directories.

 

Probably it's not clear to me how the <done-flag> works (and why here is missing).

 

What I would like to do is to generate only a single flag (like _TRIGGER or whatever) only after all the actions of the first workflow have been completed, and make the second coordinator dependent from this.

How can I do that?