Created on 12-29-2016 01:59 AM
Objective
Use Oozie's Hive 2 Action to create a workflow which will connect to Hive Serve 2 in a Kerberized environment. Execute a Hive query script which will sum the number of crimes in the crimes
database table for a particular year - passed in as a parameter queryYear
in thejob.properties
file. Write the results to a new Hive table - crimenumbers
Procedure
Log into the edge server containing the Oozie client. Change users to the oozie user.
[root@jyoung-hdp234-1 ~]# su - oozie
Authenticate to the KDC using the oozie service account kerberos keytab
[oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM
Download the City of Chicago crime data in CSV form.
[oozie@jyoung-hdp234-1 ~]$ mkdir -p /tmp/crime [oozie@jyoung-hdp234-1 ~]$ cd /tmp/crime [oozie@jyoung-hdp234-1 crime]$ curl -o crime -L https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD
Put the crime csv into HDFS into the /tmp/crime
folder.
[oozie@jyoung-hdp234-1 crime]$ hdfs dfs -mkdir -p /tmp/crime [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -copyFromLocal crime /tmp/crime/ [hdfs@jyoung-hdp234-1 tmp]$ hdfs dfs -chmod -R 777 /tmp/crime [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -chmod -R 777 /tmp/crime [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -ls /tmp/crime Found 1 items drwxrwxrwx - oozie hdfs 0 2016-12-19 08:32 /tmp/crime/crime
Log into the Hive server. Change users to the hive user.
[root@jyoung-hdp234-2 ~]# su - hive
Authenticate to the KDC using the hive service account kerberos keytab
[hive@jyoung-hdp234-2 ~]$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/jyoung-hdp234-2.openstacklocal@EXAMPLE.COM
Create the SQL DDL script that will create the schema of the crime
Hive table as an external table based on the crime
csv located in HDFS.
[hive@jyoung-hdp234-2 ~]$ cat << 'EOF' > /tmp/load_crime_table.ddl CREATE EXTERNAL TABLE IF NOT EXISTS crime( ID STRING, Case_Number STRING, Case_Date STRING, Block STRING, IUCR INT, Primary_Type STRING, Description STRING, Location_Description STRING, Arrest BOOLEAN, Domestic BOOLEAN, Beat STRING, District STRING, Ward STRING, Community_Area STRING, FBI_Code STRING, X_Coordinate INT, Y_Coordinate INT, Case_Year INT, Updated_On STRING, Latitude DOUBLE, Longitude DOUBLE, Location STRING) COMMENT 'This is crime data for the city of Chicago.' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION '/tmp/crime' TBLPROPERTIES("skip.header.line.count"="1"); EOF
Use beeline to execute the DDL and create the external Hive table.
[hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -f "/tmp/load_crime_table.ddl"
On the Oozier server / edge node, create an Oozie workflow directory for this and future oozie demo workflows
[oozie@jyoung-hdp234-1 crime]$ hdfs dfs -mkdir -p /user/oozie/ooziedemo
Create a local folder to hold development copies of your Oozie workflow project files
[oozie@jyoung-hdp234-1 crime]$ cd ~/ [oozie@jyoung-hdp234-1 ~]$ mkdir -p ooziedemo/hivedemo/app/lib [oozie@jyoung-hdp234-1 ~]$ cd ooziedemo/hivedemo
Create the job.properties
file that will contain the configuration properties and variables for the workflow
[oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > job.properties nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020 jobTracker=jyoung-hdp234-2.openstacklocal:8050 exampleDir=${nameNode}/user/${user.name}/ooziedemo/hivedemo oozie.wf.application.path=${exampleDir}/app oozie.use.system.libpath=true # Hive2 action hivescript=${oozie.wf.application.path}/crime_per_year.hql outputHiveDatabase=default jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default jdbcPrincipal=hive/_HOST@EXAMPLE.COM queryYear=2008 EOF
Create the workflow.xml
which will execute an HQL script on Hive Server 2
[oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > app/workflow.xml <workflow-app name="hivedemo" xmlns="uri:oozie:workflow:0.4"> <global> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> </global> <credentials> <credential name="hs2-creds" type="hive2"> <property> <name>hive2.server.principal</name> <value>${jdbcPrincipal}</value> </property> <property> <name>hive2.jdbc.url</name> <value>${jdbcURL}</value> </property> </credential> </credentials> <start to="hive2"/> <action name="hive2" cred="hs2-creds"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <jdbc-url>${jdbcURL}</jdbc-url> <script>${hivescript}</script> <param>queryYear=${queryYear}</param> </hive2> <ok to="End"/> <error to="Kill"/> </action> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="End"/> </workflow-app> EOF
Create the HQL script that will contain the parameterized Hive query to be executed by the workflow
[oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > app/crime_per_year.hql CREATE TABLE IF NOT EXISTS crimenumbers(year INT, number_of_crimes INT); INSERT INTO crimenumbers SELECT ${queryYear} as year, count(*) as number_of_crimes FROM crime WHERE case_date LIKE '%${queryYear}%'; EOF
Copy the hivedemo
folder to HDFS
[oozie@jyoung-hdp234-1 hivedemo]$ cd ~/ooziedemo [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal hivedemo /user/oozie/ooziedemo/ [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -ls -R /user/oozie/ooziedemo/ drwxr-xr-x - oozie hdfs 0 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo drwxr-xr-x - oozie hdfs 0 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app -rw-r--r-- 3 oozie hdfs 206 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app/crime_per_year.hql drwxr-xr-x - oozie hdfs 0 2016-12-19 08:54 /user/oozie/ooziedemo/hivedemo/app/lib -rw-r--r-- 3 oozie hdfs 968 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app/workflow.xml -rw-r--r-- 3 oozie hdfs 452 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/job.properties
Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie
every time we run the oozie command.
[oozie@jyoung-hdp234-1 hivedemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie
Run the oozie job
oozie@jyoung-hdp234-1 ooziedemo]$ cd hivedemo [oozie@jyoung-hdp234-1 hivedemo]$ oozie job -run -config job.properties -verbose -debug -auth kerberos ... job: 0000099-161213015814745-oozie-oozi-W
Watch the job info and progress
[oozie@jyoung-hdp234-1 hivedemo]$ watch -d "oozie job -info 0000099-161213015814745-oozie-oozi-W"
Verification
Check the results in HiveServer 2 via beeline
[hive@jyoung-hdp234-2 hive]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "select * from crimenumbers;" +--------------------+--------------------------------+--+ | crimenumbers.year | crimenumbers.number_of_crimes | +--------------------+--------------------------------+--+ | 2008 | 426960 | +--------------------+--------------------------------+--+ 1 row selected (0.286 seconds)
Created on 08-27-2017 01:32 AM
Created on 08-10-2018 05:29 PM
I'm pretty sure this is no longer correct. Oozie supports Kerberos delegation for the Hive2 actions though I have yet to get it all working on the latest release of HDP 2.6