Created on 12-29-2016 02:10 AM
Accept a parameter -DfileType=[csv|tsv]
from the Oozie command line. Use Oozie'sdecision node functionality to simulate an if-then-else conditional operation. If the value of the fileType
variable equals tsv
, execute a Hive 2 action which will execute theload_policestationstsv.ddl
which will in-turn load a tab-separated-value filepolicestations.tsv
into a Hive table named policestationstsv
. Else, if the value of thefileType
variable equals csv
, execute a Hive 2 action which will execute theload_policestationscsv.ddl
which will in-turn load a comma-separated-value filepolicestations.csv
into a Hive table named policestationscsv
. These Hive 2 actions will drop any pre-existing policestationstsv
or policestationscsv
tables from Hive as a preparatory step each time this workflow is run.
Procedure
On an edge node containing the oozie client, change users to the oozie user
[root@jyoung-hdp234-1 ~]# su - oozie
Authenticate to the KDC using the oozie service account kerberos keytab
[oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM
Create a local directory to hold app workflow files, properties files, Hive DDLs and TSV/CSV data files
[oozie@jyoung-hdp234-1 ~]$ cd ooziedemo [oozie@jyoung-hdp234-1 ooziedemo]$ mkdir -p decisiondemo [oozie@jyoung-hdp234-1 ooziedemo]$ cd decisiondemo/ [oozie@jyoung-hdp234-1 decisiondemo]$ mkdir -p {policestationstsv,policestationscsv}
Download the City of Chicago Police Stations data in TSV form.
[oozie@jyoung-hdp234-1 decisiondemo]$ cd policestationstsv/ [oozie@jyoung-hdp234-1 policestationstsv]$ curl -L -o policestations.tsv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.tsv?accessType=DOWNLOAD [oozie@jyoung-hdp234-1 policestationstsv]$ head -n 5 policestations.tsv DISTRICT DISTRICT NAME ADDRESS CITY STATE ZIP WEBSITE PHONE FAX TTY X COORDINATE Y COORDINATE LATITUDE LONGITUDE LOCATION 1 Central 1718 S State St Chicago IL 60616 http://home.chicagopolice.org/community/districts/1st-district-central/ 312-745-4290 312-745-3694 312-745-3693 1176569.052 1891771.704 41.85837259 -87.62735617 (41.8583725929, -87.627356171) 2 Wentworth 5101 S Wentworth Ave Chicago IL 60609 http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ 312-747-8366 312-747-5396 312-747-6656 1175864.837 1871153.753 41.80181109 -87.63056018 (41.8018110912, -87.6305601801) 3 Grand Crossing 7040 S Cottage Grove Ave Chicago IL 60637 http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ 312-747-8201 312-747-5479 312-747-9168 1182739.183 1858317.732 41.76643089 -87.60574786 (41.7664308925, -87.6057478606) 4 South Chicago 2255 E 103rd St Chicago IL 60617 http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ 312-747-7581 312-747-5276 312-747-9169 1193131.299 1837090.265 41.70793329 -87.56834912 (41.7079332906, -87.5683491228)
Download the City of Chicago Police Stations data in CSV form.
[oozie@jyoung-hdp234-1 policestationstsv]$ cd ../ [oozie@jyoung-hdp234-1 decisiondemo]$ cd policestationscsv/ [oozie@jyoung-hdp234-1 policestationscsv]$ curl -L -o policestations.csv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.csv?accessType=DOWNLOAD [oozie@jyoung-hdp234-1 policestationscsv]$ head -n 5 policestations.csv DISTRICT,DISTRICT NAME,ADDRESS,CITY,STATE,ZIP,WEBSITE,PHONE,FAX,TTY,X COORDINATE,Y COORDINATE,LATITUDE,LONGITUDE,LOCATION 1,Central,1718 S State St,Chicago,IL,60616,http://home.chicagopolice.org/community/districts/1st-district-central/,312-745-4290,312-745-3694,312-745-3693,1176569.052,1891771.704,41.85837259,-87.62735617,"(41.8583725929, -87.627356171)" 2,Wentworth,5101 S Wentworth Ave,Chicago,IL,60609,http://home.chicagopolice.org/community/districts/2nd-district-wentworth/,312-747-8366,312-747-5396,312-747-6656,1175864.837,1871153.753,41.80181109,-87.63056018,"(41.8018110912, -87.6305601801)" 3,Grand Crossing,7040 S Cottage Grove Ave,Chicago,IL,60637,http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/,312-747-8201,312-747-5479,312-747-9168,1182739.183,1858317.732,41.76643089,-87.60574786,"(41.7664308925, -87.6057478606)" 4,South Chicago,2255 E 103rd St,Chicago,IL,60617,http://home.chicagopolice.org/community/districts/4th-district-south-chicago/,312-747-7581,312-747-5276,312-747-9169,1193131.299,1837090.265,41.70793329,-87.56834912,"(41.7079332906, -87.5683491228)"
Create the SQL DDL script that will create the schema of the policestationstsv
Hive table as an external table based on the policestations.tsv
TSV file located in HDFS.
[oozie@jyoung-hdp234-1 policestationscsv]$ cd ../ [oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > load_policestationstsv.ddl DROP TABLE policestationstsv; DROP TABLE policestationscsv; CREATE EXTERNAL TABLE IF NOT EXISTS policestationstsv( DISTRICT INT, DISTRICT_NAME STRING, ADDRESS STRING, CITY STRING, STATE STRING, ZIP STRING, WEBSITE STRING, PHONE STRING, FAX STRING, TTY STRING, X_COORDINATE DOUBLE, Y_COORDINATE DOUBLE, LATITUDE DOUBLE, LONGITUDE DOUBLE, LOCATION STRING) COMMENT 'This is police station data for the city of Chicago.' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/oozie/ooziedemo/decisiondemo/policestationstsv' TBLPROPERTIES("skip.header.line.count"="1"); EOF
Create the SQL DDL script that will create the schema of the policestationscsv
Hive table as an external table based on the policestations.csv
CSV file located in HDFS.
[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > load_policestationscsv.ddl DROP TABLE policestationstsv; DROP TABLE policestationscsv; CREATE EXTERNAL TABLE IF NOT EXISTS policestationscsv( DISTRICT INT, DISTRICT_NAME STRING, ADDRESS STRING, CITY STRING, STATE STRING, ZIP STRING, WEBSITE STRING, PHONE STRING, FAX STRING, TTY STRING, X_COORDINATE DOUBLE, Y_COORDINATE DOUBLE, LATITUDE DOUBLE, LONGITUDE DOUBLE, LOCATION STRING) COMMENT 'This is police station data for the city of Chicago.' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION '/user/oozie/ooziedemo/decisiondemo/policestationscsv' TBLPROPERTIES("skip.header.line.count"="1"); EOF
Create the job.properties
file that will contain the configuration properties and variables for the workflow
[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > job.properties # Job.properties file # Workflow to run nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020 jobTracker=jyoung-hdp234-2.openstacklocal:8050 wfDir=${nameNode}/user/${user.name}/ooziedemo/decisiondemo oozie.wf.application.path=${wfDir}/workflow.xml oozie.use.system.libpath=true fileType=csv # Hive2 action loadTSVHiveScript=${wfDir}/load_policestationstsv.ddl loadCSVHiveScript=${wfDir}/load_policestationscsv.ddl outputHiveDatabase=default jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default jdbcPrincipal=hive/_HOST@EXAMPLE.COM EOF
Create the workflow.xml
which will use the decision node to execute a particular Hive DDL script on Hive Server 2 based on whether the fileType
variable equals tsv
or csv
. We're running Hive in a Kerberized environment so we include a credentials section at the top to ensure Oozie's delegation token is issued and used by Hive.
[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > workflow.xml <workflow-app name="decisionexample" xmlns="uri:oozie:workflow:0.4"> <global> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> </global> <credentials> <credential name="hs2-creds" type="hive2"> <property> <name>hive2.server.principal</name> <value>${jdbcPrincipal}</value> </property> <property> <name>hive2.jdbc.url</name> <value>${jdbcURL}</value> </property> </credential> </credentials> <start to="if-filetype"/> <decision name="if-filetype"> <switch> <case to="load-tsv">${fileType eq "tsv"}</case> <case to="load-csv">${fileType eq "csv"}</case> <default to="load-csv"/> </switch> </decision> <action name="load-tsv" cred="hs2-creds"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <jdbc-url>${jdbcURL}</jdbc-url> <script>${loadTSVHiveScript}</script> </hive2> <ok to="End"/> <error to="Kill"/> </action> <action name="load-csv" cred="hs2-creds"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <jdbc-url>${jdbcURL}</jdbc-url> <script>${loadCSVHiveScript}</script> </hive2> <ok to="End"/> <error to="Kill"/> </action> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="End"/> </workflow-app> EOF
Include an empty lib
folder to avoid lib does not exist
errors.
[oozie@jyoung-hdp234-1 decisiondemo]$ mkdir -p lib
Copy the decisiondemo
folder to HDFS
[oozie@jyoung-hdp234-1 decisiondemo]$ cd ../ [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal decisiondemo /user/oozie/ooziedemo/ [oozie@jyoung-hdp234-1 ooziedemo]$ cd decisiondemo/
Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie
every time we run the oozie command.
[oozie@jyoung-hdp234-1 decisiondemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie
Run the oozie job passing in -DfileType=tsv
to set the value of the fileType
property equal to tsv
. Afterwards, run the oozie job again passing in -DfileType=csv
instead to test out the CSV decision path.
[oozie@jyoung-hdp234-1 decisiondemo]$ oozie job -run -config job.properties -verbose -debug -auth kerberos -DfileType=tsv ... job: 0000101-161213015814745-oozie-oozi-W
Watch the job info and progress
[oozie@jyoung-hdp234-1 decisiondemo]$ watch -d "oozie job -info 0000101-161213015814745-oozie-oozi-W"
-DfileType=tsv
command line argument[root@jyoung-hdp234-2 ~]# su - hive [hive@jyoung-hdp234-2 ~]$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/jyoung-hdp234-2.openstacklocal@EXAMPLE.COM [hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables;" WARNING: Use "yarn jar" to launch YARN applications. Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485) Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485) Transaction isolation: TRANSACTION_REPEATABLE_READ +---------------+--+ | tab_name | +---------------+--+ | crime | | crimenumbers | +---------------+--+ 2 rows selected (0.163 seconds) Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
-DfileType=tsv
command line argument[hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables; select * from policestationstsv limit 5;" WARNING: Use "yarn jar" to launch YARN applications. Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485) Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485) Transaction isolation: TRANSACTION_REPEATABLE_READ +--------------------+--+ | tab_name | +--------------------+--+ | crime | | crimenumbers | | policestationstsv | +--------------------+--+ 3 rows selected (0.138 seconds) +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | policestationstsv.district | policestationstsv.district_name | policestationstsv.address | policestationstsv.city | policestationstsv.state | policestationstsv.zip | policestationstsv.website | policestationstsv.phone | policestationstsv.fax | policestationstsv.tty | policestationstsv.x_coordinate | policestationstsv.y_coordinate | policestationstsv.latitude | policestationstsv.longitude | policestationstsv.location | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | 1 | Central | 1718 S State St | Chicago | IL | 60616 | http://home.chicagopolice.org/community/districts/1st-district-central/ | 312-745-4290 | 312-745-3694 | 312-745-3693 | 1176569.052 | 1891771.704 | 41.85837259 | -87.62735617 | (41.8583725929, -87.627356171) | | 2 | Wentworth | 5101 S Wentworth Ave | Chicago | IL | 60609 | http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ | 312-747-8366 | 312-747-5396 | 312-747-6656 | 1175864.837 | 1871153.753 | 41.80181109 | -87.63056018 | (41.8018110912, -87.6305601801) | | 3 | Grand Crossing | 7040 S Cottage Grove Ave | Chicago | IL | 60637 | http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ | 312-747-8201 | 312-747-5479 | 312-747-9168 | 1182739.183 | 1858317.732 | 41.76643089 | -87.60574786 | (41.7664308925, -87.6057478606) | | 4 | South Chicago | 2255 E 103rd St | Chicago | IL | 60617 | http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ | 312-747-7581 | 312-747-5276 | 312-747-9169 | 1193131.299 | 1837090.265 | 41.70793329 | -87.56834912 | (41.7079332906, -87.5683491228) | | 5 | Calumet | 727 E 111th St | Chicago | IL | 60628 | http://home.chicagopolice.org/community/districts/5th-district-calumet/ | 312-747-8210 | 312-747-5935 | 312-747-9170 | 1183305.427 | 1831462.313 | 41.69272336 | -87.60450587 | (41.6927233639, -87.6045058667) | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ 5 rows selected (0.364 seconds) Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
-DfileType=csv
command line argument[hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables; select * from policestationscsv limit 5;" WARNING: Use "yarn jar" to launch YARN applications. Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485) Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485) Transaction isolation: TRANSACTION_REPEATABLE_READ +--------------------+--+ | tab_name | +--------------------+--+ | crime | | crimenumbers | | policestationscsv | +--------------------+--+ 3 rows selected (0.131 seconds) +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | policestationscsv.district | policestationscsv.district_name | policestationscsv.address | policestationscsv.city | policestationscsv.state | policestationscsv.zip | policestationscsv.website | policestationscsv.phone | policestationscsv.fax | policestationscsv.tty | policestationscsv.x_coordinate | policestationscsv.y_coordinate | policestationscsv.latitude | policestationscsv.longitude | policestationscsv.location | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | 1 | Central | 1718 S State St | Chicago | IL | 60616 | http://home.chicagopolice.org/community/districts/1st-district-central/ | 312-745-4290 | 312-745-3694 | 312-745-3693 | 1176569.052 | 1891771.704 | 41.85837259 | -87.62735617 | (41.8583725929, -87.627356171) | | 2 | Wentworth | 5101 S Wentworth Ave | Chicago | IL | 60609 | http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ | 312-747-8366 | 312-747-5396 | 312-747-6656 | 1175864.837 | 1871153.753 | 41.80181109 | -87.63056018 | (41.8018110912, -87.6305601801) | | 3 | Grand Crossing | 7040 S Cottage Grove Ave | Chicago | IL | 60637 | http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ | 312-747-8201 | 312-747-5479 | 312-747-9168 | 1182739.183 | 1858317.732 | 41.76643089 | -87.60574786 | (41.7664308925, -87.6057478606) | | 4 | South Chicago | 2255 E 103rd St | Chicago | IL | 60617 | http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ | 312-747-7581 | 312-747-5276 | 312-747-9169 | 1193131.299 | 1837090.265 | 41.70793329 | -87.56834912 | (41.7079332906, -87.5683491228) | | 5 | Calumet | 727 E 111th St | Chicago | IL | 60628 | http://home.chicagopolice.org/community/districts/5th-district-calumet/ | 312-747-8210 | 312-747-5935 | 312-747-9170 | 1183305.427 | 1831462.313 | 41.69272336 | -87.60450587 | (41.6927233639, -87.6045058667) | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ 5 rows selected (0.116 seconds) Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM