Created on 12-29-2016 02:05 AM
Objective
Loop over a list of primary crime types ("THEFT", "STALKING", "GAMBLING", "DOMESTIC VIOLENCE"). For each crime type, query hive, get the sum of that primary crime type from the crime table, insert the sum into a new table (crimespertype).
This demo uses the oozieloop project written by Jeremy Beard in order to emulate looping in Oozie via sub-workflows. Please see the oozieloop project home page for a detailed explanation of how it works: https://github.com/jeremybeard/oozieloop
IMPORTANT!: This demo requires the crime
database table existing in Hive as created in the first exercise: Hive-2-Action-in-a-Kerberized-cluster
On an edge node containing the oozie client, install git if it doesn't already exist
[root@jyoung-hdp234-1 ~]# yum install git-all
Change users to the oozie user
[root@jyoung-hdp234-1 ~]# su - oozie
Authenticate to the KDC using the oozie service account kerberos keytab
[oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM
Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie
every time we run the oozie command.
[oozie@jyoung-hdp234-1 ooziedemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie
Git-clone the oozieloop repository
[oozie@jyoung-hdp234-1 ooziedemo]$ git clone https://github.com/jeremybeard/oozieloop.git
Create a local directory to hold app workflow files
[oozie@jyoung-hdp234-1 ooziedemo]$ mkdir -p sumcrimetypes
Copy the oozieloop xml files to your workflow directory
[oozie@jyoung-hdp234-1 ooziedemo]$ cp oozieloop/*.xml sumcrimetypes/
Create a job properties file. Include a special key-value "loop_list" which will contain the list of values to loop over.
[oozie@jyoung-hdp234-1 ooziedemo]$ cd sumcrimetypes/ [oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sumcrimetypes.properties nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020 jobTracker=jyoung-hdp234-2.openstacklocal:8050 wfDir=${nameNode}/user/${user.name}/ooziedemo/sumcrimetypes oozie.wf.application.path=${wfDir}/loop_sumcrimetypes.xml oozie.use.system.libpath=true loopWorkflowPath=${wfDir}/loop_crime_types.xml loop_parallel=false loop_type=list loop_list=THEFT,STALKING,GAMBLING,DOMESTIC VIOLENCE # Hive2 action sumcrimetypesHiveScript=${wfDir}/sum_crime_types.hql outputHiveDatabase=default jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default jdbcPrincipal=hive/_HOST@EXAMPLE.COM EOF
Replace instances of '/your/path/to/' found in the oozieloop xml files with the loop workflow path variable '${wfDir}'
[oozie@jyoung-hdp234-1 sumcrimetypes]$ find ./ -name "*.xml" -type f -exec sed -i -e 's|/your/path/to/|\${wfDir}/|g' {} \;
Create the workflow that will use oozieloops loops.xml as a sub-workflow to loop over the loop_list property and execute sumcrimetypes.xml workflow for each crime type in the loop_list
[oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > loop_sumcrimetypes.xml <workflow-app name="loop_sumcrimetypes" xmlns="uri:oozie:workflow:0.4"> <start to="loop"/> <action name="loop"> <sub-workflow> <app-path>${wfDir}/loop.xml</app-path> <propagate-configuration/> <configuration> <property> <name>loop_action</name> <value>${wfDir}/sumcrimetypes.xml</value> </property> <property> <name>loop_name</name> <value>sum_crime_types</value> </property> </configuration> </sub-workflow> <ok to="end"/> <error to="error"/> </action> <kill name="error"> <message>An error occurred whle executing the loop / sum_crime_types sub-workflow! Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> EOF
Create the workflow sumcrimetypes.xml that will be called in the loop. This workflow will execute a hive HQL script on HiveServer2 passing in the primaryCrimeType as a parameter to the hive query
[oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sumcrimetypes.xml <workflow-app name="sumcrimetypes_${loop_value}" xmlns="uri:oozie:workflow:0.4"> <global> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> </global> <credentials> <credential name="hs2-creds" type="hive2"> <property> <name>hive2.server.principal</name> <value>${jdbcPrincipal}</value> </property> <property> <name>hive2.jdbc.url</name> <value>${jdbcURL}</value> </property> </credential> </credentials> <start to="sumcrimetypes"/> <action name="sumcrimetypes" cred="hs2-creds"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <jdbc-url>${jdbcURL}</jdbc-url> <script>${sumcrimetypesHiveScript}</script> <param>primaryCrimeType=${loop_value}</param> </hive2> <ok to="end"/> <error to="error"/> </action> <kill name="error"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> EOF
Create the HQL file sum_crime_types.hql which will contain our parameterized Hive queries for summing the number of records per specified crime type and inserting the result into a new table - crimespertype
[oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sum_crime_types.hql CREATE TABLE IF NOT EXISTS crimespertype(primary_type STRING, number_of_crimes INT); INSERT INTO crimespertype SELECT '${primaryCrimeType}' AS primary_type, count(*) AS number_of_crimes FROM crime WHERE Primary_Type='${primaryCrimeType}'; EOF
Create an empty lib dir
[oozie@jyoung-hdp234-1 sumcrimetypes]$ mkdir -p lib
Copy the oozie workflow directory to HDFS
[oozie@jyoung-hdp234-1 sumcrimetypes]$ cd ../ [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal sumcrimetypes /user/oozie/ooziedemo/ [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -ls -R /user/oozie/ooziedemo/sumcrimetypes drwxr-xr-x - oozie hdfs 0 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/lib -rw-r--r-- 3 oozie hdfs 1861 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop.xml -rw-r--r-- 3 oozie hdfs 4853 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_list_step.xml -rw-r--r-- 3 oozie hdfs 3912 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_range_step.xml -rw-r--r-- 3 oozie hdfs 952 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_sumcrimetypes.xml -rw-r--r-- 3 oozie hdfs 240 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sum_crime_types.hql -rw-r--r-- 3 oozie hdfs 580 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sumcrimetypes.properties -rw-r--r-- 3 oozie hdfs 1026 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sumcrimetypes.xml
Run the oozie job
[oozie@jyoung-hdp234-1 ooziedemo]$ oozie job -run -config sumcrimetypes/sumcrimetypes.properties -verbose -debug -auth kerberos job: 0000059-161213015814745-oozie-oozi-W
Watch the job info and progress
[oozie@jyoung-hdp234-1 ooziedemo]$ watch -d "oozie job -info 0000059-161213015814745-oozie-oozi-W"
Verification
Check the results in HiveServer 2 via beeline
[hive@jyoung-hdp234-2 hive]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "select * from crimespertype;" +-----------------------------+---------------------------------+--+ | crimespertype.primary_type | crimespertype.number_of_crimes | +-----------------------------+---------------------------------+--+ | THEFT | 1292228 | | STALKING | 2983 | | GAMBLING | 14035 | | DOMESTIC VIOLENCE | 1 | +-----------------------------+---------------------------------+--+ 4 rows selected (0.258 seconds)