Created on 12-29-2016 01:59 AM
Objective
Use Oozie's Hive 2 Action to create a workflow which will connect to Hive Serve 2 in a Kerberized environment. Execute a Hive query script which will sum the number of crimes in the crimes database table for a particular year - passed in as a parameter queryYear in thejob.properties file. Write the results to a new Hive table - crimenumbers
Procedure
Log into the edge server containing the Oozie client. Change users to the oozie user.
[root@jyoung-hdp234-1 ~]# su - oozie
Authenticate to the KDC using the oozie service account kerberos keytab
[oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM
Download the City of Chicago crime data in CSV form.
[oozie@jyoung-hdp234-1 ~]$ mkdir -p /tmp/crime [oozie@jyoung-hdp234-1 ~]$ cd /tmp/crime [oozie@jyoung-hdp234-1 crime]$ curl -o crime -L https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD
Put the crime csv into HDFS into the /tmp/crime folder.
[oozie@jyoung-hdp234-1 crime]$ hdfs dfs -mkdir -p /tmp/crime [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -copyFromLocal crime /tmp/crime/ [hdfs@jyoung-hdp234-1 tmp]$ hdfs dfs -chmod -R 777 /tmp/crime [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -chmod -R 777 /tmp/crime [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -ls /tmp/crime Found 1 items drwxrwxrwx - oozie hdfs 0 2016-12-19 08:32 /tmp/crime/crime
Log into the Hive server. Change users to the hive user.
[root@jyoung-hdp234-2 ~]# su - hive
Authenticate to the KDC using the hive service account kerberos keytab
[hive@jyoung-hdp234-2 ~]$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/jyoung-hdp234-2.openstacklocal@EXAMPLE.COM
Create the SQL DDL script that will create the schema of the crime Hive table as an external table based on the crime csv located in HDFS.
[hive@jyoung-hdp234-2 ~]$ cat << 'EOF' > /tmp/load_crime_table.ddl
CREATE EXTERNAL TABLE IF NOT EXISTS crime(
ID STRING,
Case_Number STRING,
Case_Date STRING,
Block STRING,
IUCR INT,
Primary_Type STRING,
Description STRING,
Location_Description STRING,
Arrest BOOLEAN,
Domestic BOOLEAN,
Beat STRING,
District STRING,
Ward STRING,
Community_Area STRING,
FBI_Code STRING,
X_Coordinate INT,
Y_Coordinate INT,
Case_Year INT,
Updated_On STRING,
Latitude DOUBLE,
Longitude DOUBLE,
Location STRING)
COMMENT 'This is crime data for the city of Chicago.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/tmp/crime'
TBLPROPERTIES("skip.header.line.count"="1");
EOF
Use beeline to execute the DDL and create the external Hive table.
[hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -f "/tmp/load_crime_table.ddl"
On the Oozier server / edge node, create an Oozie workflow directory for this and future oozie demo workflows
[oozie@jyoung-hdp234-1 crime]$ hdfs dfs -mkdir -p /user/oozie/ooziedemo
Create a local folder to hold development copies of your Oozie workflow project files
[oozie@jyoung-hdp234-1 crime]$ cd ~/ [oozie@jyoung-hdp234-1 ~]$ mkdir -p ooziedemo/hivedemo/app/lib [oozie@jyoung-hdp234-1 ~]$ cd ooziedemo/hivedemo
Create the job.properties file that will contain the configuration properties and variables for the workflow
[oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > job.properties
nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020
jobTracker=jyoung-hdp234-2.openstacklocal:8050
exampleDir=${nameNode}/user/${user.name}/ooziedemo/hivedemo
oozie.wf.application.path=${exampleDir}/app
oozie.use.system.libpath=true
# Hive2 action
hivescript=${oozie.wf.application.path}/crime_per_year.hql
outputHiveDatabase=default
jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default
jdbcPrincipal=hive/_HOST@EXAMPLE.COM
queryYear=2008
EOF
Create the workflow.xml which will execute an HQL script on Hive Server 2
[oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > app/workflow.xml
<workflow-app name="hivedemo" xmlns="uri:oozie:workflow:0.4">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
</global>
<credentials>
<credential name="hs2-creds" type="hive2">
<property>
<name>hive2.server.principal</name>
<value>${jdbcPrincipal}</value>
</property>
<property>
<name>hive2.jdbc.url</name>
<value>${jdbcURL}</value>
</property>
</credential>
</credentials>
<start to="hive2"/>
<action name="hive2" cred="hs2-creds">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<jdbc-url>${jdbcURL}</jdbc-url>
<script>${hivescript}</script>
<param>queryYear=${queryYear}</param>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="End"/>
</workflow-app>
EOF
Create the HQL script that will contain the parameterized Hive query to be executed by the workflow
[oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > app/crime_per_year.hql
CREATE TABLE IF NOT EXISTS crimenumbers(year INT, number_of_crimes INT);
INSERT INTO crimenumbers SELECT ${queryYear} as year, count(*) as number_of_crimes FROM crime WHERE case_date LIKE '%${queryYear}%';
EOF
Copy the hivedemo folder to HDFS
[oozie@jyoung-hdp234-1 hivedemo]$ cd ~/ooziedemo [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal hivedemo /user/oozie/ooziedemo/ [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -ls -R /user/oozie/ooziedemo/ drwxr-xr-x - oozie hdfs 0 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo drwxr-xr-x - oozie hdfs 0 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app -rw-r--r-- 3 oozie hdfs 206 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app/crime_per_year.hql drwxr-xr-x - oozie hdfs 0 2016-12-19 08:54 /user/oozie/ooziedemo/hivedemo/app/lib -rw-r--r-- 3 oozie hdfs 968 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app/workflow.xml -rw-r--r-- 3 oozie hdfs 452 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/job.properties
Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie every time we run the oozie command.
[oozie@jyoung-hdp234-1 hivedemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie
Run the oozie job
oozie@jyoung-hdp234-1 ooziedemo]$ cd hivedemo [oozie@jyoung-hdp234-1 hivedemo]$ oozie job -run -config job.properties -verbose -debug -auth kerberos ... job: 0000099-161213015814745-oozie-oozi-W
Watch the job info and progress
[oozie@jyoung-hdp234-1 hivedemo]$ watch -d "oozie job -info 0000099-161213015814745-oozie-oozi-W"
Verification
Check the results in HiveServer 2 via beeline
[hive@jyoung-hdp234-2 hive]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "select * from crimenumbers;" +--------------------+--------------------------------+--+ | crimenumbers.year | crimenumbers.number_of_crimes | +--------------------+--------------------------------+--+ | 2008 | 426960 | +--------------------+--------------------------------+--+ 1 row selected (0.286 seconds)
Created on 08-27-2017 01:32 AM
Created on 08-10-2018 05:29 PM
I'm pretty sure this is no longer correct. Oozie supports Kerberos delegation for the Hive2 actions though I have yet to get it all working on the latest release of HDP 2.6