Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Contributor

Objective

Loop over a list of primary crime types ("THEFT", "STALKING", "GAMBLING", "DOMESTIC VIOLENCE"). For each crime type, query hive, get the sum of that primary crime type from the crime table, insert the sum into a new table (crimespertype).

This demo uses the oozieloop project written by Jeremy Beard in order to emulate looping in Oozie via sub-workflows. Please see the oozieloop project home page for a detailed explanation of how it works: https://github.com/jeremybeard/oozieloop

IMPORTANT!: This demo requires the crime database table existing in Hive as created in the first exercise: Hive-2-Action-in-a-Kerberized-cluster

Procedure

On an edge node containing the oozie client, install git if it doesn't already exist

[root@jyoung-hdp234-1 ~]# yum install git-all

Change users to the oozie user

[root@jyoung-hdp234-1 ~]# su - oozie

Authenticate to the KDC using the oozie service account kerberos keytab

[oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM

Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie every time we run the oozie command.

[oozie@jyoung-hdp234-1 ooziedemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie

Git-clone the oozieloop repository

[oozie@jyoung-hdp234-1 ooziedemo]$ git clone https://github.com/jeremybeard/oozieloop.git

Create a local directory to hold app workflow files

[oozie@jyoung-hdp234-1 ooziedemo]$ mkdir -p sumcrimetypes

Copy the oozieloop xml files to your workflow directory

[oozie@jyoung-hdp234-1 ooziedemo]$ cp oozieloop/*.xml sumcrimetypes/

Create a job properties file. Include a special key-value "loop_list" which will contain the list of values to loop over.

[oozie@jyoung-hdp234-1 ooziedemo]$ cd sumcrimetypes/
[oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sumcrimetypes.properties
nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020
jobTracker=jyoung-hdp234-2.openstacklocal:8050
wfDir=${nameNode}/user/${user.name}/ooziedemo/sumcrimetypes
oozie.wf.application.path=${wfDir}/loop_sumcrimetypes.xml
oozie.use.system.libpath=true
loopWorkflowPath=${wfDir}/loop_crime_types.xml
loop_parallel=false
loop_type=list
loop_list=THEFT,STALKING,GAMBLING,DOMESTIC VIOLENCE
# Hive2 action
sumcrimetypesHiveScript=${wfDir}/sum_crime_types.hql
outputHiveDatabase=default
jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default
jdbcPrincipal=hive/_HOST@EXAMPLE.COM
EOF

Replace instances of '/your/path/to/' found in the oozieloop xml files with the loop workflow path variable '${wfDir}'

[oozie@jyoung-hdp234-1 sumcrimetypes]$ find ./ -name "*.xml" -type f  -exec sed -i -e 's|/your/path/to/|\${wfDir}/|g' {} \;

Create the workflow that will use oozieloops loops.xml as a sub-workflow to loop over the loop_list property and execute sumcrimetypes.xml workflow for each crime type in the loop_list

[oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > loop_sumcrimetypes.xml
<workflow-app name="loop_sumcrimetypes" xmlns="uri:oozie:workflow:0.4">
    <start to="loop"/>

    <action name="loop">
        <sub-workflow>
            <app-path>${wfDir}/loop.xml</app-path>
            <propagate-configuration/>
            <configuration>
                <property>
                    <name>loop_action</name>
                    <value>${wfDir}/sumcrimetypes.xml</value>
                </property>
                <property>
                    <name>loop_name</name>
                    <value>sum_crime_types</value>
                </property>
            </configuration>
        </sub-workflow>
        <ok to="end"/>
        <error to="error"/>
    </action>

    <kill name="error">
        <message>An error occurred whle executing the loop / sum_crime_types sub-workflow! Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>

    <end name="end"/>
</workflow-app>
EOF

Create the workflow sumcrimetypes.xml that will be called in the loop. This workflow will execute a hive HQL script on HiveServer2 passing in the primaryCrimeType as a parameter to the hive query

[oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sumcrimetypes.xml
<workflow-app name="sumcrimetypes_${loop_value}" xmlns="uri:oozie:workflow:0.4">
  <global>
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
  </global>
  <credentials>
    <credential name="hs2-creds" type="hive2">
      <property>
        <name>hive2.server.principal</name>
          <value>${jdbcPrincipal}</value>
      </property>
      <property>
       <name>hive2.jdbc.url</name>
         <value>${jdbcURL}</value>
      </property>
    </credential>
  </credentials>
  <start to="sumcrimetypes"/>
    <action name="sumcrimetypes" cred="hs2-creds">
      <hive2 xmlns="uri:oozie:hive2-action:0.1">
        <jdbc-url>${jdbcURL}</jdbc-url>
        <script>${sumcrimetypesHiveScript}</script>
        <param>primaryCrimeType=${loop_value}</param>
      </hive2>
      <ok to="end"/>
      <error to="error"/>
    </action>
  <kill name="error">
    <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
  </kill>
  <end name="end"/>
</workflow-app>
EOF

Create the HQL file sum_crime_types.hql which will contain our parameterized Hive queries for summing the number of records per specified crime type and inserting the result into a new table - crimespertype

[oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sum_crime_types.hql
CREATE TABLE IF NOT EXISTS crimespertype(primary_type STRING, number_of_crimes INT);
INSERT INTO crimespertype SELECT '${primaryCrimeType}' AS primary_type, count(*) AS number_of_crimes FROM crime WHERE Primary_Type='${primaryCrimeType}';
EOF

Create an empty lib dir

[oozie@jyoung-hdp234-1 sumcrimetypes]$ mkdir -p lib

Copy the oozie workflow directory to HDFS

[oozie@jyoung-hdp234-1 sumcrimetypes]$ cd ../
[oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal sumcrimetypes /user/oozie/ooziedemo/
[oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -ls -R /user/oozie/ooziedemo/sumcrimetypes
drwxr-xr-x   - oozie hdfs          0 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/lib
-rw-r--r--   3 oozie hdfs       1861 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop.xml
-rw-r--r--   3 oozie hdfs       4853 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_list_step.xml
-rw-r--r--   3 oozie hdfs       3912 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_range_step.xml
-rw-r--r--   3 oozie hdfs        952 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_sumcrimetypes.xml
-rw-r--r--   3 oozie hdfs        240 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sum_crime_types.hql
-rw-r--r--   3 oozie hdfs        580 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sumcrimetypes.properties
-rw-r--r--   3 oozie hdfs       1026 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sumcrimetypes.xml

Run the oozie job

[oozie@jyoung-hdp234-1 ooziedemo]$ oozie job -run -config sumcrimetypes/sumcrimetypes.properties -verbose -debug -auth kerberos
job: 0000059-161213015814745-oozie-oozi-W

Watch the job info and progress

[oozie@jyoung-hdp234-1 ooziedemo]$ watch -d "oozie job -info 0000059-161213015814745-oozie-oozi-W"

Verification

Check the results in HiveServer 2 via beeline

[hive@jyoung-hdp234-2 hive]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "select * from crimespertype;"
+-----------------------------+---------------------------------+--+
| crimespertype.primary_type  | crimespertype.number_of_crimes  |
+-----------------------------+---------------------------------+--+
| THEFT                       | 1292228                         |
| STALKING                    | 2983                            |
| GAMBLING                    | 14035                           |
| DOMESTIC VIOLENCE           | 1                               |
+-----------------------------+---------------------------------+--+
4 rows selected (0.258 seconds)
7,334 Views