Created on 12-29-2016 02:10 AM
Accept a parameter -DfileType=[csv|tsv] from the Oozie command line. Use Oozie'sdecision node functionality to simulate an if-then-else conditional operation. If the value of the fileType variable equals tsv, execute a Hive 2 action which will execute theload_policestationstsv.ddl which will in-turn load a tab-separated-value filepolicestations.tsv into a Hive table named policestationstsv. Else, if the value of thefileType variable equals csv, execute a Hive 2 action which will execute theload_policestationscsv.ddl which will in-turn load a comma-separated-value filepolicestations.csv into a Hive table named policestationscsv. These Hive 2 actions will drop any pre-existing policestationstsv or policestationscsv tables from Hive as a preparatory step each time this workflow is run.
Procedure
On an edge node containing the oozie client, change users to the oozie user
[root@jyoung-hdp234-1 ~]# su - oozie
Authenticate to the KDC using the oozie service account kerberos keytab
[oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM
Create a local directory to hold app workflow files, properties files, Hive DDLs and TSV/CSV data files
[oozie@jyoung-hdp234-1 ~]$ cd ooziedemo
[oozie@jyoung-hdp234-1 ooziedemo]$ mkdir -p decisiondemo
[oozie@jyoung-hdp234-1 ooziedemo]$ cd decisiondemo/
[oozie@jyoung-hdp234-1 decisiondemo]$ mkdir -p {policestationstsv,policestationscsv}
Download the City of Chicago Police Stations data in TSV form.
[oozie@jyoung-hdp234-1 decisiondemo]$ cd policestationstsv/ [oozie@jyoung-hdp234-1 policestationstsv]$ curl -L -o policestations.tsv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.tsv?accessType=DOWNLOAD [oozie@jyoung-hdp234-1 policestationstsv]$ head -n 5 policestations.tsv DISTRICT DISTRICT NAME ADDRESS CITY STATE ZIP WEBSITE PHONE FAX TTY X COORDINATE Y COORDINATE LATITUDE LONGITUDE LOCATION 1 Central 1718 S State St Chicago IL 60616 http://home.chicagopolice.org/community/districts/1st-district-central/ 312-745-4290 312-745-3694 312-745-3693 1176569.052 1891771.704 41.85837259 -87.62735617 (41.8583725929, -87.627356171) 2 Wentworth 5101 S Wentworth Ave Chicago IL 60609 http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ 312-747-8366 312-747-5396 312-747-6656 1175864.837 1871153.753 41.80181109 -87.63056018 (41.8018110912, -87.6305601801) 3 Grand Crossing 7040 S Cottage Grove Ave Chicago IL 60637 http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ 312-747-8201 312-747-5479 312-747-9168 1182739.183 1858317.732 41.76643089 -87.60574786 (41.7664308925, -87.6057478606) 4 South Chicago 2255 E 103rd St Chicago IL 60617 http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ 312-747-7581 312-747-5276 312-747-9169 1193131.299 1837090.265 41.70793329 -87.56834912 (41.7079332906, -87.5683491228)
Download the City of Chicago Police Stations data in CSV form.
[oozie@jyoung-hdp234-1 policestationstsv]$ cd ../ [oozie@jyoung-hdp234-1 decisiondemo]$ cd policestationscsv/ [oozie@jyoung-hdp234-1 policestationscsv]$ curl -L -o policestations.csv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.csv?accessType=DOWNLOAD [oozie@jyoung-hdp234-1 policestationscsv]$ head -n 5 policestations.csv DISTRICT,DISTRICT NAME,ADDRESS,CITY,STATE,ZIP,WEBSITE,PHONE,FAX,TTY,X COORDINATE,Y COORDINATE,LATITUDE,LONGITUDE,LOCATION 1,Central,1718 S State St,Chicago,IL,60616,http://home.chicagopolice.org/community/districts/1st-district-central/,312-745-4290,312-745-3694,312-745-3693,1176569.052,1891771.704,41.85837259,-87.62735617,"(41.8583725929, -87.627356171)" 2,Wentworth,5101 S Wentworth Ave,Chicago,IL,60609,http://home.chicagopolice.org/community/districts/2nd-district-wentworth/,312-747-8366,312-747-5396,312-747-6656,1175864.837,1871153.753,41.80181109,-87.63056018,"(41.8018110912, -87.6305601801)" 3,Grand Crossing,7040 S Cottage Grove Ave,Chicago,IL,60637,http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/,312-747-8201,312-747-5479,312-747-9168,1182739.183,1858317.732,41.76643089,-87.60574786,"(41.7664308925, -87.6057478606)" 4,South Chicago,2255 E 103rd St,Chicago,IL,60617,http://home.chicagopolice.org/community/districts/4th-district-south-chicago/,312-747-7581,312-747-5276,312-747-9169,1193131.299,1837090.265,41.70793329,-87.56834912,"(41.7079332906, -87.5683491228)"
Create the SQL DDL script that will create the schema of the policestationstsv Hive table as an external table based on the policestations.tsv TSV file located in HDFS.
[oozie@jyoung-hdp234-1 policestationscsv]$ cd ../
[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > load_policestationstsv.ddl
DROP TABLE policestationstsv;
DROP TABLE policestationscsv;
CREATE EXTERNAL TABLE IF NOT EXISTS policestationstsv(
DISTRICT INT,
DISTRICT_NAME STRING,
ADDRESS STRING,
CITY STRING,
STATE STRING,
ZIP STRING,
WEBSITE STRING,
PHONE STRING,
FAX STRING,
TTY STRING,
X_COORDINATE DOUBLE,
Y_COORDINATE DOUBLE,
LATITUDE DOUBLE,
LONGITUDE DOUBLE,
LOCATION STRING)
COMMENT 'This is police station data for the city of Chicago.'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/oozie/ooziedemo/decisiondemo/policestationstsv'
TBLPROPERTIES("skip.header.line.count"="1");
EOF
Create the SQL DDL script that will create the schema of the policestationscsv Hive table as an external table based on the policestations.csv CSV file located in HDFS.
[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > load_policestationscsv.ddl
DROP TABLE policestationstsv;
DROP TABLE policestationscsv;
CREATE EXTERNAL TABLE IF NOT EXISTS policestationscsv(
DISTRICT INT,
DISTRICT_NAME STRING,
ADDRESS STRING,
CITY STRING,
STATE STRING,
ZIP STRING,
WEBSITE STRING,
PHONE STRING,
FAX STRING,
TTY STRING,
X_COORDINATE DOUBLE,
Y_COORDINATE DOUBLE,
LATITUDE DOUBLE,
LONGITUDE DOUBLE,
LOCATION STRING)
COMMENT 'This is police station data for the city of Chicago.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/user/oozie/ooziedemo/decisiondemo/policestationscsv'
TBLPROPERTIES("skip.header.line.count"="1");
EOF
Create the job.properties file that will contain the configuration properties and variables for the workflow
[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > job.properties
# Job.properties file
# Workflow to run
nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020
jobTracker=jyoung-hdp234-2.openstacklocal:8050
wfDir=${nameNode}/user/${user.name}/ooziedemo/decisiondemo
oozie.wf.application.path=${wfDir}/workflow.xml
oozie.use.system.libpath=true
fileType=csv
# Hive2 action
loadTSVHiveScript=${wfDir}/load_policestationstsv.ddl
loadCSVHiveScript=${wfDir}/load_policestationscsv.ddl
outputHiveDatabase=default
jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default
jdbcPrincipal=hive/_HOST@EXAMPLE.COM
EOF
Create the workflow.xml which will use the decision node to execute a particular Hive DDL script on Hive Server 2 based on whether the fileType variable equals tsv or csv. We're running Hive in a Kerberized environment so we include a credentials section at the top to ensure Oozie's delegation token is issued and used by Hive.
[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > workflow.xml
<workflow-app name="decisionexample" xmlns="uri:oozie:workflow:0.4">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
</global>
<credentials>
<credential name="hs2-creds" type="hive2">
<property>
<name>hive2.server.principal</name>
<value>${jdbcPrincipal}</value>
</property>
<property>
<name>hive2.jdbc.url</name>
<value>${jdbcURL}</value>
</property>
</credential>
</credentials>
<start to="if-filetype"/>
<decision name="if-filetype">
<switch>
<case to="load-tsv">${fileType eq "tsv"}</case>
<case to="load-csv">${fileType eq "csv"}</case>
<default to="load-csv"/>
</switch>
</decision>
<action name="load-tsv" cred="hs2-creds">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<jdbc-url>${jdbcURL}</jdbc-url>
<script>${loadTSVHiveScript}</script>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="load-csv" cred="hs2-creds">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<jdbc-url>${jdbcURL}</jdbc-url>
<script>${loadCSVHiveScript}</script>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="End"/>
</workflow-app>
EOF
Include an empty lib folder to avoid lib does not exist errors.
[oozie@jyoung-hdp234-1 decisiondemo]$ mkdir -p lib
Copy the decisiondemo folder to HDFS
[oozie@jyoung-hdp234-1 decisiondemo]$ cd ../ [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal decisiondemo /user/oozie/ooziedemo/ [oozie@jyoung-hdp234-1 ooziedemo]$ cd decisiondemo/
Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie every time we run the oozie command.
[oozie@jyoung-hdp234-1 decisiondemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie
Run the oozie job passing in -DfileType=tsv to set the value of the fileType property equal to tsv. Afterwards, run the oozie job again passing in -DfileType=csv instead to test out the CSV decision path.
[oozie@jyoung-hdp234-1 decisiondemo]$ oozie job -run -config job.properties -verbose -debug -auth kerberos -DfileType=tsv ... job: 0000101-161213015814745-oozie-oozi-W
Watch the job info and progress
[oozie@jyoung-hdp234-1 decisiondemo]$ watch -d "oozie job -info 0000101-161213015814745-oozie-oozi-W"
-DfileType=tsv command line argument[root@jyoung-hdp234-2 ~]# su - hive [hive@jyoung-hdp234-2 ~]$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/jyoung-hdp234-2.openstacklocal@EXAMPLE.COM [hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables;" WARNING: Use "yarn jar" to launch YARN applications. Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485) Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485) Transaction isolation: TRANSACTION_REPEATABLE_READ +---------------+--+ | tab_name | +---------------+--+ | crime | | crimenumbers | +---------------+--+ 2 rows selected (0.163 seconds) Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
-DfileType=tsv command line argument[hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables; select * from policestationstsv limit 5;" WARNING: Use "yarn jar" to launch YARN applications. Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485) Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485) Transaction isolation: TRANSACTION_REPEATABLE_READ +--------------------+--+ | tab_name | +--------------------+--+ | crime | | crimenumbers | | policestationstsv | +--------------------+--+ 3 rows selected (0.138 seconds) +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | policestationstsv.district | policestationstsv.district_name | policestationstsv.address | policestationstsv.city | policestationstsv.state | policestationstsv.zip | policestationstsv.website | policestationstsv.phone | policestationstsv.fax | policestationstsv.tty | policestationstsv.x_coordinate | policestationstsv.y_coordinate | policestationstsv.latitude | policestationstsv.longitude | policestationstsv.location | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | 1 | Central | 1718 S State St | Chicago | IL | 60616 | http://home.chicagopolice.org/community/districts/1st-district-central/ | 312-745-4290 | 312-745-3694 | 312-745-3693 | 1176569.052 | 1891771.704 | 41.85837259 | -87.62735617 | (41.8583725929, -87.627356171) | | 2 | Wentworth | 5101 S Wentworth Ave | Chicago | IL | 60609 | http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ | 312-747-8366 | 312-747-5396 | 312-747-6656 | 1175864.837 | 1871153.753 | 41.80181109 | -87.63056018 | (41.8018110912, -87.6305601801) | | 3 | Grand Crossing | 7040 S Cottage Grove Ave | Chicago | IL | 60637 | http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ | 312-747-8201 | 312-747-5479 | 312-747-9168 | 1182739.183 | 1858317.732 | 41.76643089 | -87.60574786 | (41.7664308925, -87.6057478606) | | 4 | South Chicago | 2255 E 103rd St | Chicago | IL | 60617 | http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ | 312-747-7581 | 312-747-5276 | 312-747-9169 | 1193131.299 | 1837090.265 | 41.70793329 | -87.56834912 | (41.7079332906, -87.5683491228) | | 5 | Calumet | 727 E 111th St | Chicago | IL | 60628 | http://home.chicagopolice.org/community/districts/5th-district-calumet/ | 312-747-8210 | 312-747-5935 | 312-747-9170 | 1183305.427 | 1831462.313 | 41.69272336 | -87.60450587 | (41.6927233639, -87.6045058667) | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ 5 rows selected (0.364 seconds) Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
-DfileType=csv command line argument[hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables; select * from policestationscsv limit 5;" WARNING: Use "yarn jar" to launch YARN applications. Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485) Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485) Transaction isolation: TRANSACTION_REPEATABLE_READ +--------------------+--+ | tab_name | +--------------------+--+ | crime | | crimenumbers | | policestationscsv | +--------------------+--+ 3 rows selected (0.131 seconds) +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | policestationscsv.district | policestationscsv.district_name | policestationscsv.address | policestationscsv.city | policestationscsv.state | policestationscsv.zip | policestationscsv.website | policestationscsv.phone | policestationscsv.fax | policestationscsv.tty | policestationscsv.x_coordinate | policestationscsv.y_coordinate | policestationscsv.latitude | policestationscsv.longitude | policestationscsv.location | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | 1 | Central | 1718 S State St | Chicago | IL | 60616 | http://home.chicagopolice.org/community/districts/1st-district-central/ | 312-745-4290 | 312-745-3694 | 312-745-3693 | 1176569.052 | 1891771.704 | 41.85837259 | -87.62735617 | (41.8583725929, -87.627356171) | | 2 | Wentworth | 5101 S Wentworth Ave | Chicago | IL | 60609 | http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ | 312-747-8366 | 312-747-5396 | 312-747-6656 | 1175864.837 | 1871153.753 | 41.80181109 | -87.63056018 | (41.8018110912, -87.6305601801) | | 3 | Grand Crossing | 7040 S Cottage Grove Ave | Chicago | IL | 60637 | http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ | 312-747-8201 | 312-747-5479 | 312-747-9168 | 1182739.183 | 1858317.732 | 41.76643089 | -87.60574786 | (41.7664308925, -87.6057478606) | | 4 | South Chicago | 2255 E 103rd St | Chicago | IL | 60617 | http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ | 312-747-7581 | 312-747-5276 | 312-747-9169 | 1193131.299 | 1837090.265 | 41.70793329 | -87.56834912 | (41.7079332906, -87.5683491228) | | 5 | Calumet | 727 E 111th St | Chicago | IL | 60628 | http://home.chicagopolice.org/community/districts/5th-district-calumet/ | 312-747-8210 | 312-747-5935 | 312-747-9170 | 1183305.427 | 1831462.313 | 41.69272336 | -87.60450587 | (41.6927233639, -87.6045058667) | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ 5 rows selected (0.116 seconds) Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM