Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Contributor

Objective

Accept a parameter -DfileType=[csv|tsv] from the Oozie command line. Use Oozie'sdecision node functionality to simulate an if-then-else conditional operation. If the value of the fileType variable equals tsv, execute a Hive 2 action which will execute theload_policestationstsv.ddl which will in-turn load a tab-separated-value filepolicestations.tsv into a Hive table named policestationstsv. Else, if the value of thefileType variable equals csv, execute a Hive 2 action which will execute theload_policestationscsv.ddl which will in-turn load a comma-separated-value filepolicestations.csv into a Hive table named policestationscsv. These Hive 2 actions will drop any pre-existing policestationstsv or policestationscsv tables from Hive as a preparatory step each time this workflow is run.

Procedure

On an edge node containing the oozie client, change users to the oozie user

[root@jyoung-hdp234-1 ~]# su - oozie

Authenticate to the KDC using the oozie service account kerberos keytab

[oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM

Create a local directory to hold app workflow files, properties files, Hive DDLs and TSV/CSV data files

[oozie@jyoung-hdp234-1 ~]$ cd ooziedemo
[oozie@jyoung-hdp234-1 ooziedemo]$ mkdir -p decisiondemo
[oozie@jyoung-hdp234-1 ooziedemo]$ cd decisiondemo/
[oozie@jyoung-hdp234-1 decisiondemo]$ mkdir -p {policestationstsv,policestationscsv}

Download the City of Chicago Police Stations data in TSV form.

[oozie@jyoung-hdp234-1 decisiondemo]$ cd policestationstsv/
[oozie@jyoung-hdp234-1 policestationstsv]$ curl -L -o policestations.tsv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.tsv?accessType=DOWNLOAD
[oozie@jyoung-hdp234-1 policestationstsv]$ head -n 5 policestations.tsv 
DISTRICT  DISTRICT NAME ADDRESS CITY  STATE ZIP WEBSITE PHONE FAX TTY X COORDINATE  Y COORDINATE  LATITUDE  LONGITUDE LOCATION
1 Central 1718 S State St Chicago IL  60616 http://home.chicagopolice.org/community/districts/1st-district-central/ 312-745-4290  312-745-3694  312-745-3693  1176569.052 1891771.704 41.85837259 -87.62735617  (41.8583725929, -87.627356171)
2 Wentworth 5101 S Wentworth Ave  Chicago IL  60609 http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ 312-747-8366  312-747-5396  312-747-6656  1175864.837 1871153.753 41.80181109 -87.63056018  (41.8018110912, -87.6305601801)
3 Grand Crossing  7040 S Cottage Grove Ave  Chicago IL  60637 http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/  312-747-8201  312-747-5479  312-747-9168  1182739.183 1858317.732 41.76643089 -87.60574786  (41.7664308925, -87.6057478606)
4 South Chicago 2255 E 103rd St Chicago IL  60617 http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ 312-747-7581  312-747-5276  312-747-9169  1193131.299 1837090.265 41.70793329 -87.56834912  (41.7079332906, -87.5683491228)

Download the City of Chicago Police Stations data in CSV form.

[oozie@jyoung-hdp234-1 policestationstsv]$ cd ../
[oozie@jyoung-hdp234-1 decisiondemo]$ cd policestationscsv/
[oozie@jyoung-hdp234-1 policestationscsv]$ curl -L -o policestations.csv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.csv?accessType=DOWNLOAD
[oozie@jyoung-hdp234-1 policestationscsv]$ head -n 5 policestations.csv 
DISTRICT,DISTRICT NAME,ADDRESS,CITY,STATE,ZIP,WEBSITE,PHONE,FAX,TTY,X COORDINATE,Y COORDINATE,LATITUDE,LONGITUDE,LOCATION
1,Central,1718 S State St,Chicago,IL,60616,http://home.chicagopolice.org/community/districts/1st-district-central/,312-745-4290,312-745-3694,312-745-3693,1176569.052,1891771.704,41.85837259,-87.62735617,"(41.8583725929, -87.627356171)"
2,Wentworth,5101 S Wentworth Ave,Chicago,IL,60609,http://home.chicagopolice.org/community/districts/2nd-district-wentworth/,312-747-8366,312-747-5396,312-747-6656,1175864.837,1871153.753,41.80181109,-87.63056018,"(41.8018110912, -87.6305601801)"
3,Grand Crossing,7040 S Cottage Grove Ave,Chicago,IL,60637,http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/,312-747-8201,312-747-5479,312-747-9168,1182739.183,1858317.732,41.76643089,-87.60574786,"(41.7664308925, -87.6057478606)"
4,South Chicago,2255 E 103rd St,Chicago,IL,60617,http://home.chicagopolice.org/community/districts/4th-district-south-chicago/,312-747-7581,312-747-5276,312-747-9169,1193131.299,1837090.265,41.70793329,-87.56834912,"(41.7079332906, -87.5683491228)"

Create the SQL DDL script that will create the schema of the policestationstsv Hive table as an external table based on the policestations.tsv TSV file located in HDFS.

[oozie@jyoung-hdp234-1 policestationscsv]$ cd ../
[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > load_policestationstsv.ddl
DROP TABLE policestationstsv;
DROP TABLE policestationscsv;
CREATE EXTERNAL TABLE IF NOT EXISTS policestationstsv(
    DISTRICT INT,
    DISTRICT_NAME STRING,
    ADDRESS STRING,
    CITY STRING,
    STATE STRING,
    ZIP STRING,
    WEBSITE STRING,
    PHONE STRING,
    FAX STRING,
    TTY STRING,
    X_COORDINATE DOUBLE,
    Y_COORDINATE DOUBLE,
    LATITUDE DOUBLE,
    LONGITUDE DOUBLE,
    LOCATION STRING)
COMMENT 'This is police station data for the city of Chicago.'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/oozie/ooziedemo/decisiondemo/policestationstsv'
TBLPROPERTIES("skip.header.line.count"="1");
EOF

Create the SQL DDL script that will create the schema of the policestationscsv Hive table as an external table based on the policestations.csv CSV file located in HDFS.

[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > load_policestationscsv.ddl
DROP TABLE policestationstsv;
DROP TABLE policestationscsv;
CREATE EXTERNAL TABLE IF NOT EXISTS policestationscsv(
    DISTRICT INT,
    DISTRICT_NAME STRING,
    ADDRESS STRING,
    CITY STRING,
    STATE STRING,
    ZIP STRING,
    WEBSITE STRING,
    PHONE STRING,
    FAX STRING,
    TTY STRING,
    X_COORDINATE DOUBLE,
    Y_COORDINATE DOUBLE,
    LATITUDE DOUBLE,
    LONGITUDE DOUBLE,
    LOCATION STRING)
COMMENT 'This is police station data for the city of Chicago.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/user/oozie/ooziedemo/decisiondemo/policestationscsv'
TBLPROPERTIES("skip.header.line.count"="1");
EOF

Create the job.properties file that will contain the configuration properties and variables for the workflow

[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > job.properties
# Job.properties file
# Workflow to run
nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020
jobTracker=jyoung-hdp234-2.openstacklocal:8050
wfDir=${nameNode}/user/${user.name}/ooziedemo/decisiondemo
oozie.wf.application.path=${wfDir}/workflow.xml
oozie.use.system.libpath=true
fileType=csv
# Hive2 action
loadTSVHiveScript=${wfDir}/load_policestationstsv.ddl
loadCSVHiveScript=${wfDir}/load_policestationscsv.ddl
outputHiveDatabase=default
jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default
jdbcPrincipal=hive/_HOST@EXAMPLE.COM
EOF

Create the workflow.xml which will use the decision node to execute a particular Hive DDL script on Hive Server 2 based on whether the fileType variable equals tsv or csv. We're running Hive in a Kerberized environment so we include a credentials section at the top to ensure Oozie's delegation token is issued and used by Hive.

[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > workflow.xml
<workflow-app name="decisionexample" xmlns="uri:oozie:workflow:0.4">
    <global>
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
    </global>
    <credentials>
        <credential name="hs2-creds" type="hive2">
            <property>
                <name>hive2.server.principal</name>
                <value>${jdbcPrincipal}</value>
            </property>
            <property>
                <name>hive2.jdbc.url</name>
                <value>${jdbcURL}</value>
            </property>
        </credential>
    </credentials>
    <start to="if-filetype"/>
    <decision name="if-filetype">
        <switch>
            <case to="load-tsv">${fileType eq "tsv"}</case>
            <case to="load-csv">${fileType eq "csv"}</case>
            <default to="load-csv"/>
        </switch>
    </decision>
    <action name="load-tsv" cred="hs2-creds">
        <hive2 xmlns="uri:oozie:hive2-action:0.1">
            <jdbc-url>${jdbcURL}</jdbc-url>
            <script>${loadTSVHiveScript}</script>
        </hive2>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <action name="load-csv" cred="hs2-creds">
        <hive2 xmlns="uri:oozie:hive2-action:0.1">
            <jdbc-url>${jdbcURL}</jdbc-url>
            <script>${loadCSVHiveScript}</script>
        </hive2>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="End"/>
</workflow-app>
EOF

Include an empty lib folder to avoid lib does not exist errors.

[oozie@jyoung-hdp234-1 decisiondemo]$ mkdir -p lib

Copy the decisiondemo folder to HDFS

[oozie@jyoung-hdp234-1 decisiondemo]$ cd ../
[oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal decisiondemo /user/oozie/ooziedemo/
[oozie@jyoung-hdp234-1 ooziedemo]$ cd decisiondemo/

Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie every time we run the oozie command.

[oozie@jyoung-hdp234-1 decisiondemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie

Run the oozie job passing in -DfileType=tsv to set the value of the fileType property equal to tsv. Afterwards, run the oozie job again passing in -DfileType=csv instead to test out the CSV decision path.

[oozie@jyoung-hdp234-1 decisiondemo]$ oozie job -run -config job.properties -verbose -debug -auth kerberos -DfileType=tsv
...
job: 0000101-161213015814745-oozie-oozi-W

Watch the job info and progress

[oozie@jyoung-hdp234-1 decisiondemo]$ watch -d "oozie job -info 0000101-161213015814745-oozie-oozi-W"

Verification

Before running Oozie job with -DfileType=tsv command line argument

[root@jyoung-hdp234-2 ~]# su - hive
[hive@jyoung-hdp234-2 ~]$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/jyoung-hdp234-2.openstacklocal@EXAMPLE.COM
[hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables;"
WARNING: Use "yarn jar" to launch YARN applications.
Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485)
Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+---------------+--+
|   tab_name    |
+---------------+--+
| crime         |
| crimenumbers  |
+---------------+--+
2 rows selected (0.163 seconds)
Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive
Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM

After running Oozie job with -DfileType=tsv command line argument

[hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables; select * from policestationstsv limit 5;"
WARNING: Use "yarn jar" to launch YARN applications.
Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485)
Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+--------------------+--+
|      tab_name      |
+--------------------+--+
| crime              |
| crimenumbers       |
| policestationstsv  |
+--------------------+--+
3 rows selected (0.138 seconds)
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
| policestationstsv.district  | policestationstsv.district_name  | policestationstsv.address  | policestationstsv.city  | policestationstsv.state  | policestationstsv.zip  |                            policestationstsv.website                            | policestationstsv.phone  | policestationstsv.fax  | policestationstsv.tty  | policestationstsv.x_coordinate  | policestationstsv.y_coordinate  | policestationstsv.latitude  | policestationstsv.longitude  |    policestationstsv.location    |
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
| 1                           | Central                          | 1718 S State St            | Chicago                 | IL                       | 60616                  | http://home.chicagopolice.org/community/districts/1st-district-central/         | 312-745-4290             | 312-745-3694           | 312-745-3693           | 1176569.052                     | 1891771.704                     | 41.85837259                 | -87.62735617                 | (41.8583725929, -87.627356171)   |
| 2                           | Wentworth                        | 5101 S Wentworth Ave       | Chicago                 | IL                       | 60609                  | http://home.chicagopolice.org/community/districts/2nd-district-wentworth/       | 312-747-8366             | 312-747-5396           | 312-747-6656           | 1175864.837                     | 1871153.753                     | 41.80181109                 | -87.63056018                 | (41.8018110912, -87.6305601801)  |
| 3                           | Grand Crossing                   | 7040 S Cottage Grove Ave   | Chicago                 | IL                       | 60637                  | http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/  | 312-747-8201             | 312-747-5479           | 312-747-9168           | 1182739.183                     | 1858317.732                     | 41.76643089                 | -87.60574786                 | (41.7664308925, -87.6057478606)  |
| 4                           | South Chicago                    | 2255 E 103rd St            | Chicago                 | IL                       | 60617                  | http://home.chicagopolice.org/community/districts/4th-district-south-chicago/   | 312-747-7581             | 312-747-5276           | 312-747-9169           | 1193131.299                     | 1837090.265                     | 41.70793329                 | -87.56834912                 | (41.7079332906, -87.5683491228)  |
| 5                           | Calumet                          | 727 E 111th St             | Chicago                 | IL                       | 60628                  | http://home.chicagopolice.org/community/districts/5th-district-calumet/         | 312-747-8210             | 312-747-5935           | 312-747-9170           | 1183305.427                     | 1831462.313                     | 41.69272336                 | -87.60450587                 | (41.6927233639, -87.6045058667)  |
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
5 rows selected (0.364 seconds)
Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive
Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM

After running Oozie job with -DfileType=csv command line argument

[hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables; select * from policestationscsv limit 5;"
WARNING: Use "yarn jar" to launch YARN applications.
Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485)
Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+--------------------+--+
|      tab_name      |
+--------------------+--+
| crime              |
| crimenumbers       |
| policestationscsv  |
+--------------------+--+
3 rows selected (0.131 seconds)
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
| policestationscsv.district  | policestationscsv.district_name  | policestationscsv.address  | policestationscsv.city  | policestationscsv.state  | policestationscsv.zip  |                            policestationscsv.website                            | policestationscsv.phone  | policestationscsv.fax  | policestationscsv.tty  | policestationscsv.x_coordinate  | policestationscsv.y_coordinate  | policestationscsv.latitude  | policestationscsv.longitude  |    policestationscsv.location    |
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
| 1                           | Central                          | 1718 S State St            | Chicago                 | IL                       | 60616                  | http://home.chicagopolice.org/community/districts/1st-district-central/         | 312-745-4290             | 312-745-3694           | 312-745-3693           | 1176569.052                     | 1891771.704                     | 41.85837259                 | -87.62735617                 | (41.8583725929, -87.627356171)   |
| 2                           | Wentworth                        | 5101 S Wentworth Ave       | Chicago                 | IL                       | 60609                  | http://home.chicagopolice.org/community/districts/2nd-district-wentworth/       | 312-747-8366             | 312-747-5396           | 312-747-6656           | 1175864.837                     | 1871153.753                     | 41.80181109                 | -87.63056018                 | (41.8018110912, -87.6305601801)  |
| 3                           | Grand Crossing                   | 7040 S Cottage Grove Ave   | Chicago                 | IL                       | 60637                  | http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/  | 312-747-8201             | 312-747-5479           | 312-747-9168           | 1182739.183                     | 1858317.732                     | 41.76643089                 | -87.60574786                 | (41.7664308925, -87.6057478606)  |
| 4                           | South Chicago                    | 2255 E 103rd St            | Chicago                 | IL                       | 60617                  | http://home.chicagopolice.org/community/districts/4th-district-south-chicago/   | 312-747-7581             | 312-747-5276           | 312-747-9169           | 1193131.299                     | 1837090.265                     | 41.70793329                 | -87.56834912                 | (41.7079332906, -87.5683491228)  |
| 5                           | Calumet                          | 727 E 111th St             | Chicago                 | IL                       | 60628                  | http://home.chicagopolice.org/community/districts/5th-district-calumet/         | 312-747-8210             | 312-747-5935           | 312-747-9170           | 1183305.427                     | 1831462.313                     | 41.69272336                 | -87.60450587                 | (41.6927233639, -87.6045058667)  |
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
5 rows selected (0.116 seconds)
Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive
Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
11,126 Views