Member since
07-12-2016
15
Posts
11
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5997 | 01-18-2017 08:46 AM | |
2495 | 01-09-2017 11:09 PM |
01-18-2017
08:46 AM
1 Kudo
Hello @hardik desai You can get similar information from the /var/log/hadoop/hdfs/hdfs-audit.log log file.
However rm would show up as a rename command. ls would show up as listStatus, etc.
Here I create a file /tmp/deleteme.txt, upload it to HDFS, list it parent directory, and delete the file. [hdfs@jyoung-hdp234-1 ~]$ echo "delete me" >> /tmp/deleteme.txt
[hdfs@jyoung-hdp234-1 ~]$ hdfs dfs -put /tmp/deleteme.txt /tmp/
[hdfs@jyoung-hdp234-1 ~]$ hdfs dfs -ls /tmp
Found 13 items
drwx------ - ambari-qa hdfs 0 2016-12-01 04:19 /tmp/ambari-qa
drwxrwxrwx - oozie hdfs 0 2016-12-19 08:32 /tmp/crime
-rw-r--r-- 3 hdfs hdfs 10 2017-01-18 08:29 /tmp/deleteme.txt
drwxr-xr-x - hdfs hdfs 0 2016-12-01 04:15 /tmp/entity-file-history
drwx-wx-wx - ambari-qa hdfs 0 2016-12-11 18:36 /tmp/hive
-rwxr-xr-x 3 hdfs hdfs 1616 2016-12-01 04:16 /tmp/id1aacdf51_date160116
-rwxr-xr-x 3 hdfs hdfs 1616 2016-12-01 05:19 /tmp/id1aacdf51_date190116
-rwxr-xr-x 3 ambari-qa hdfs 1616 2016-12-01 04:21 /tmp/idtest.ambari-qa.1480566109.56.in
-rwxr-xr-x 3 ambari-qa hdfs 957 2016-12-01 04:21 /tmp/idtest.ambari-qa.1480566109.56.pig
-rwxr-xr-x 3 ambari-qa hdfs 1616 2016-12-01 05:23 /tmp/idtest.ambari-qa.1480569805.86.in
-rwxr-xr-x 3 ambari-qa hdfs 957 2016-12-01 05:23 /tmp/idtest.ambari-qa.1480569805.86.pig
drwxr-xr-x - ambari-qa hdfs 0 2016-12-01 04:19 /tmp/tezsmokeinput
drwxr-xr-x - ambari-qa hdfs 0 2016-12-01 05:21 /tmp/tezsmokeoutput
[hdfs@jyoung-hdp234-1 ~]$ hdfs dfs -rm /tmp/deleteme.txt
17/01/18 08:29:46 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://jyoung-hdp234-1.openstacklocal:8020/tmp/deleteme.txt' to trash at: hdfs://jyoung-hdp234-1.openstacklocal:8020/user/hdfs/.Trash/Current
You can see these operations in my hdfs-audit.log file below: [root@jyoung-hdp234-1 hdfs]# grep "CLI" /var/log/hadoop/hdfs/hdfs-audit.log
2017-01-18 08:28:29,254 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:13,847 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:13,942 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:13,947 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt._COPYING_ dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:13,990 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=create src=/tmp/deleteme.txt._COPYING_ dst=null perm=hdfs:hdfs:rw-r--r-- proto=rpc callerContext=CLI
2017-01-18 08:29:14,022 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt._COPYING_ dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:14,240 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=rename src=/tmp/deleteme.txt._COPYING_ dst=/tmp/deleteme.txt perm=hdfs:hdfs:rw-r--r-- proto=rpc callerContext=CLI
2017-01-18 08:29:27,129 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:27,247 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=listStatus src=/tmp dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:27,466 INFO FSNamesystem.audit: allowed=true ugi=hbase/jyoung-hdp234-3.openstacklocal@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.225 cmd=listStatus src=/apps/hbase/data/oldWALs dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:27,469 INFO FSNamesystem.audit: allowed=true ugi=hbase/jyoung-hdp234-3.openstacklocal@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.225 cmd=listStatus src=/apps/hbase/data/archive dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:46,457 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:46,552 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:46,570 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:46,585 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=mkdirs src=/user/hdfs/.Trash/Current/tmp dst=null perm=hdfs:hdfs:rwx------ proto=rpc callerContext=CLI
2017-01-18 08:29:46,590 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/user/hdfs/.Trash/Current/tmp/deleteme.txt dst=null perm=null proto=rpc callerContext=CLI
2017-01-18 08:29:46,595 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=rename src=/tmp/deleteme.txt dst=/user/hdfs/.Trash/Current/tmp/deleteme.txt perm=hdfs:hdfs:rw-r--r-- proto=rpc callerContext=CLI
... View more
01-10-2017
09:54 PM
I'm happy to hear that worked out for you. Feel free to accept the answer if you're happy with it. Thanks!
... View more
01-09-2017
11:09 PM
1 Kudo
Hello. If you're not using Off-heap memory (Bucketcache) you can try disabling the 3 configuration properties and 1 environment variable setting that gets added during the Ambari upgrade. Using Ambari, modify your Hbase configuration and blank the following:
hbase.bucketcache.size
hbase.bucketcache.ioengine
hbase.bucketcache.percentage.in.combinedcache
Modify hbase-env template:
comment out line: export HBASE_REGIONSERVER_OPTS = ... -XX:MaxDirectMemorySize
Restart all affected
... View more
12-29-2016
02:10 AM
1 Kudo
Objective Accept a parameter -DfileType=[csv|tsv] from the Oozie command line. Use Oozie'sdecision node functionality to simulate an if-then-else conditional operation. If the value of the fileType variable equals tsv , execute a Hive 2 action which will execute the load_policestationstsv.ddl which will in-turn load a tab-separated-value file policestations.tsv into a Hive table named policestationstsv . Else, if the value of the fileType variable equals csv , execute a Hive 2 action which will execute the load_policestationscsv.ddl which will in-turn load a comma-separated-value file policestations.csv into a Hive table named policestationscsv . These Hive 2 actions will drop any pre-existing policestationstsv or policestationscsv tables from Hive as a preparatory step each time this workflow is run. Procedure On an edge node containing the oozie client, change users to the oozie user [root@jyoung-hdp234-1 ~]# su - oozie
Authenticate to the KDC using the oozie service account kerberos keytab [oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM
Create a local directory to hold app workflow files, properties files, Hive DDLs and TSV/CSV data files [oozie@jyoung-hdp234-1 ~]$ cd ooziedemo
[oozie@jyoung-hdp234-1 ooziedemo]$ mkdir -p decisiondemo
[oozie@jyoung-hdp234-1 ooziedemo]$ cd decisiondemo/
[oozie@jyoung-hdp234-1 decisiondemo]$ mkdir -p {policestationstsv,policestationscsv}
Download the City of Chicago Police Stations data in TSV form. [oozie@jyoung-hdp234-1 decisiondemo]$ cd policestationstsv/
[oozie@jyoung-hdp234-1 policestationstsv]$ curl -L -o policestations.tsv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.tsv?accessType=DOWNLOAD
[oozie@jyoung-hdp234-1 policestationstsv]$ head -n 5 policestations.tsv
DISTRICT DISTRICT NAME ADDRESS CITY STATE ZIP WEBSITE PHONE FAX TTY X COORDINATE Y COORDINATE LATITUDE LONGITUDE LOCATION
1 Central 1718 S State St Chicago IL 60616 http://home.chicagopolice.org/community/districts/1st-district-central/ 312-745-4290 312-745-3694 312-745-3693 1176569.052 1891771.704 41.85837259 -87.62735617 (41.8583725929, -87.627356171)
2 Wentworth 5101 S Wentworth Ave Chicago IL 60609 http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ 312-747-8366 312-747-5396 312-747-6656 1175864.837 1871153.753 41.80181109 -87.63056018 (41.8018110912, -87.6305601801)
3 Grand Crossing 7040 S Cottage Grove Ave Chicago IL 60637 http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ 312-747-8201 312-747-5479 312-747-9168 1182739.183 1858317.732 41.76643089 -87.60574786 (41.7664308925, -87.6057478606)
4 South Chicago 2255 E 103rd St Chicago IL 60617 http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ 312-747-7581 312-747-5276 312-747-9169 1193131.299 1837090.265 41.70793329 -87.56834912 (41.7079332906, -87.5683491228)
Download the City of Chicago Police Stations data in CSV form. [oozie@jyoung-hdp234-1 policestationstsv]$ cd ../
[oozie@jyoung-hdp234-1 decisiondemo]$ cd policestationscsv/
[oozie@jyoung-hdp234-1 policestationscsv]$ curl -L -o policestations.csv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.csv?accessType=DOWNLOAD
[oozie@jyoung-hdp234-1 policestationscsv]$ head -n 5 policestations.csv
DISTRICT,DISTRICT NAME,ADDRESS,CITY,STATE,ZIP,WEBSITE,PHONE,FAX,TTY,X COORDINATE,Y COORDINATE,LATITUDE,LONGITUDE,LOCATION
1,Central,1718 S State St,Chicago,IL,60616,http://home.chicagopolice.org/community/districts/1st-district-central/,312-745-4290,312-745-3694,312-745-3693,1176569.052,1891771.704,41.85837259,-87.62735617,"(41.8583725929, -87.627356171)"
2,Wentworth,5101 S Wentworth Ave,Chicago,IL,60609,http://home.chicagopolice.org/community/districts/2nd-district-wentworth/,312-747-8366,312-747-5396,312-747-6656,1175864.837,1871153.753,41.80181109,-87.63056018,"(41.8018110912, -87.6305601801)"
3,Grand Crossing,7040 S Cottage Grove Ave,Chicago,IL,60637,http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/,312-747-8201,312-747-5479,312-747-9168,1182739.183,1858317.732,41.76643089,-87.60574786,"(41.7664308925, -87.6057478606)"
4,South Chicago,2255 E 103rd St,Chicago,IL,60617,http://home.chicagopolice.org/community/districts/4th-district-south-chicago/,312-747-7581,312-747-5276,312-747-9169,1193131.299,1837090.265,41.70793329,-87.56834912,"(41.7079332906, -87.5683491228)"
Create the SQL DDL script that will create the schema of the policestationstsv Hive table as an external table based on the policestations.tsv TSV file located in HDFS. [oozie@jyoung-hdp234-1 policestationscsv]$ cd ../
[oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > load_policestationstsv.ddl
DROP TABLE policestationstsv;
DROP TABLE policestationscsv;
CREATE EXTERNAL TABLE IF NOT EXISTS policestationstsv(
DISTRICT INT,
DISTRICT_NAME STRING,
ADDRESS STRING,
CITY STRING,
STATE STRING,
ZIP STRING,
WEBSITE STRING,
PHONE STRING,
FAX STRING,
TTY STRING,
X_COORDINATE DOUBLE,
Y_COORDINATE DOUBLE,
LATITUDE DOUBLE,
LONGITUDE DOUBLE,
LOCATION STRING)
COMMENT 'This is police station data for the city of Chicago.'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/oozie/ooziedemo/decisiondemo/policestationstsv'
TBLPROPERTIES("skip.header.line.count"="1");
EOF
Create the SQL DDL script that will create the schema of the policestationscsv Hive table as an external table based on the policestations.csv CSV file located in HDFS. [oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > load_policestationscsv.ddl
DROP TABLE policestationstsv;
DROP TABLE policestationscsv;
CREATE EXTERNAL TABLE IF NOT EXISTS policestationscsv(
DISTRICT INT,
DISTRICT_NAME STRING,
ADDRESS STRING,
CITY STRING,
STATE STRING,
ZIP STRING,
WEBSITE STRING,
PHONE STRING,
FAX STRING,
TTY STRING,
X_COORDINATE DOUBLE,
Y_COORDINATE DOUBLE,
LATITUDE DOUBLE,
LONGITUDE DOUBLE,
LOCATION STRING)
COMMENT 'This is police station data for the city of Chicago.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/user/oozie/ooziedemo/decisiondemo/policestationscsv'
TBLPROPERTIES("skip.header.line.count"="1");
EOF
Create the job.properties file that will contain the configuration properties and variables for the workflow [oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > job.properties
# Job.properties file
# Workflow to run
nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020
jobTracker=jyoung-hdp234-2.openstacklocal:8050
wfDir=${nameNode}/user/${user.name}/ooziedemo/decisiondemo
oozie.wf.application.path=${wfDir}/workflow.xml
oozie.use.system.libpath=true
fileType=csv
# Hive2 action
loadTSVHiveScript=${wfDir}/load_policestationstsv.ddl
loadCSVHiveScript=${wfDir}/load_policestationscsv.ddl
outputHiveDatabase=default
jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default
jdbcPrincipal=hive/_HOST@EXAMPLE.COM
EOF
Create the workflow.xml which will use the decision node to execute a particular Hive DDL script on Hive Server 2 based on whether the fileType variable equals tsv or csv . We're running Hive in a Kerberized environment so we include a credentials section at the top to ensure Oozie's delegation token is issued and used by Hive. [oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > workflow.xml
<workflow-app name="decisionexample" xmlns="uri:oozie:workflow:0.4">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
</global>
<credentials>
<credential name="hs2-creds" type="hive2">
<property>
<name>hive2.server.principal</name>
<value>${jdbcPrincipal}</value>
</property>
<property>
<name>hive2.jdbc.url</name>
<value>${jdbcURL}</value>
</property>
</credential>
</credentials>
<start to="if-filetype"/>
<decision name="if-filetype">
<switch>
<case to="load-tsv">${fileType eq "tsv"}</case>
<case to="load-csv">${fileType eq "csv"}</case>
<default to="load-csv"/>
</switch>
</decision>
<action name="load-tsv" cred="hs2-creds">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<jdbc-url>${jdbcURL}</jdbc-url>
<script>${loadTSVHiveScript}</script>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action>
<action name="load-csv" cred="hs2-creds">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<jdbc-url>${jdbcURL}</jdbc-url>
<script>${loadCSVHiveScript}</script>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="End"/>
</workflow-app>
EOF
Include an empty lib folder to avoid lib does not exist errors. [oozie@jyoung-hdp234-1 decisiondemo]$ mkdir -p lib
Copy the decisiondemo folder to HDFS [oozie@jyoung-hdp234-1 decisiondemo]$ cd ../
[oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal decisiondemo /user/oozie/ooziedemo/
[oozie@jyoung-hdp234-1 ooziedemo]$ cd decisiondemo/
Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie every time we run the oozie command. [oozie@jyoung-hdp234-1 decisiondemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie
Run the oozie job passing in -DfileType=tsv to set the value of the fileType property equal to tsv . Afterwards, run the oozie job again passing in -DfileType=csv instead to test out the CSV decision path. [oozie@jyoung-hdp234-1 decisiondemo]$ oozie job -run -config job.properties -verbose -debug -auth kerberos -DfileType=tsv
...
job: 0000101-161213015814745-oozie-oozi-W
Watch the job info and progress [oozie@jyoung-hdp234-1 decisiondemo]$ watch -d "oozie job -info 0000101-161213015814745-oozie-oozi-W"
Verification Before running Oozie job with -DfileType=tsv command line argument [root@jyoung-hdp234-2 ~]# su - hive
[hive@jyoung-hdp234-2 ~]$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/jyoung-hdp234-2.openstacklocal@EXAMPLE.COM
[hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables;"
WARNING: Use "yarn jar" to launch YARN applications.
Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485)
Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+---------------+--+
| tab_name |
+---------------+--+
| crime |
| crimenumbers |
+---------------+--+
2 rows selected (0.163 seconds)
Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive
Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
After running Oozie job with -DfileType=tsv command line argument [hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables; select * from policestationstsv limit 5;"
WARNING: Use "yarn jar" to launch YARN applications.
Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485)
Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+--------------------+--+
| tab_name |
+--------------------+--+
| crime |
| crimenumbers |
| policestationstsv |
+--------------------+--+
3 rows selected (0.138 seconds)
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
| policestationstsv.district | policestationstsv.district_name | policestationstsv.address | policestationstsv.city | policestationstsv.state | policestationstsv.zip | policestationstsv.website | policestationstsv.phone | policestationstsv.fax | policestationstsv.tty | policestationstsv.x_coordinate | policestationstsv.y_coordinate | policestationstsv.latitude | policestationstsv.longitude | policestationstsv.location |
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
| 1 | Central | 1718 S State St | Chicago | IL | 60616 | http://home.chicagopolice.org/community/districts/1st-district-central/ | 312-745-4290 | 312-745-3694 | 312-745-3693 | 1176569.052 | 1891771.704 | 41.85837259 | -87.62735617 | (41.8583725929, -87.627356171) |
| 2 | Wentworth | 5101 S Wentworth Ave | Chicago | IL | 60609 | http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ | 312-747-8366 | 312-747-5396 | 312-747-6656 | 1175864.837 | 1871153.753 | 41.80181109 | -87.63056018 | (41.8018110912, -87.6305601801) |
| 3 | Grand Crossing | 7040 S Cottage Grove Ave | Chicago | IL | 60637 | http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ | 312-747-8201 | 312-747-5479 | 312-747-9168 | 1182739.183 | 1858317.732 | 41.76643089 | -87.60574786 | (41.7664308925, -87.6057478606) |
| 4 | South Chicago | 2255 E 103rd St | Chicago | IL | 60617 | http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ | 312-747-7581 | 312-747-5276 | 312-747-9169 | 1193131.299 | 1837090.265 | 41.70793329 | -87.56834912 | (41.7079332906, -87.5683491228) |
| 5 | Calumet | 727 E 111th St | Chicago | IL | 60628 | http://home.chicagopolice.org/community/districts/5th-district-calumet/ | 312-747-8210 | 312-747-5935 | 312-747-9170 | 1183305.427 | 1831462.313 | 41.69272336 | -87.60450587 | (41.6927233639, -87.6045058667) |
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
5 rows selected (0.364 seconds)
Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive
Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
After running Oozie job with -DfileType=csv command line argument [hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables; select * from policestationscsv limit 5;"
WARNING: Use "yarn jar" to launch YARN applications.
Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485)
Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+--------------------+--+
| tab_name |
+--------------------+--+
| crime |
| crimenumbers |
| policestationscsv |
+--------------------+--+
3 rows selected (0.131 seconds)
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
| policestationscsv.district | policestationscsv.district_name | policestationscsv.address | policestationscsv.city | policestationscsv.state | policestationscsv.zip | policestationscsv.website | policestationscsv.phone | policestationscsv.fax | policestationscsv.tty | policestationscsv.x_coordinate | policestationscsv.y_coordinate | policestationscsv.latitude | policestationscsv.longitude | policestationscsv.location |
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
| 1 | Central | 1718 S State St | Chicago | IL | 60616 | http://home.chicagopolice.org/community/districts/1st-district-central/ | 312-745-4290 | 312-745-3694 | 312-745-3693 | 1176569.052 | 1891771.704 | 41.85837259 | -87.62735617 | (41.8583725929, -87.627356171) |
| 2 | Wentworth | 5101 S Wentworth Ave | Chicago | IL | 60609 | http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ | 312-747-8366 | 312-747-5396 | 312-747-6656 | 1175864.837 | 1871153.753 | 41.80181109 | -87.63056018 | (41.8018110912, -87.6305601801) |
| 3 | Grand Crossing | 7040 S Cottage Grove Ave | Chicago | IL | 60637 | http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ | 312-747-8201 | 312-747-5479 | 312-747-9168 | 1182739.183 | 1858317.732 | 41.76643089 | -87.60574786 | (41.7664308925, -87.6057478606) |
| 4 | South Chicago | 2255 E 103rd St | Chicago | IL | 60617 | http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ | 312-747-7581 | 312-747-5276 | 312-747-9169 | 1193131.299 | 1837090.265 | 41.70793329 | -87.56834912 | (41.7079332906, -87.5683491228) |
| 5 | Calumet | 727 E 111th St | Chicago | IL | 60628 | http://home.chicagopolice.org/community/districts/5th-district-calumet/ | 312-747-8210 | 312-747-5935 | 312-747-9170 | 1183305.427 | 1831462.313 | 41.69272336 | -87.60450587 | (41.6927233639, -87.6045058667) |
+-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+
5 rows selected (0.116 seconds)
Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive
Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM
... View more
Labels:
12-29-2016
02:05 AM
2 Kudos
Objective Loop over a list of primary crime types ("THEFT", "STALKING", "GAMBLING", "DOMESTIC VIOLENCE"). For each crime type, query hive, get the sum of that primary crime type from the crime table, insert the sum into a new table (crimespertype). This demo uses the oozieloop project written by Jeremy Beard in order to emulate looping in Oozie via sub-workflows. Please see the oozieloop project home page for a detailed explanation of how it works: https://github.com/jeremybeard/oozieloop IMPORTANT!: This demo requires the crime database table existing in Hive as created in the first exercise: Hive-2-Action-in-a-Kerberized-cluster Procedure On an edge node containing the oozie client, install git if it doesn't already exist [root@jyoung-hdp234-1 ~]# yum install git-all
Change users to the oozie user [root@jyoung-hdp234-1 ~]# su - oozie
Authenticate to the KDC using the oozie service account kerberos keytab [oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM
Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie every time we run the oozie command. [oozie@jyoung-hdp234-1 ooziedemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie
Git-clone the oozieloop repository [oozie@jyoung-hdp234-1 ooziedemo]$ git clone https://github.com/jeremybeard/oozieloop.git
Create a local directory to hold app workflow files [oozie@jyoung-hdp234-1 ooziedemo]$ mkdir -p sumcrimetypes
Copy the oozieloop xml files to your workflow directory [oozie@jyoung-hdp234-1 ooziedemo]$ cp oozieloop/*.xml sumcrimetypes/
Create a job properties file. Include a special key-value "loop_list" which will contain the list of values to loop over. [oozie@jyoung-hdp234-1 ooziedemo]$ cd sumcrimetypes/
[oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sumcrimetypes.properties
nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020
jobTracker=jyoung-hdp234-2.openstacklocal:8050
wfDir=${nameNode}/user/${user.name}/ooziedemo/sumcrimetypes
oozie.wf.application.path=${wfDir}/loop_sumcrimetypes.xml
oozie.use.system.libpath=true
loopWorkflowPath=${wfDir}/loop_crime_types.xml
loop_parallel=false
loop_type=list
loop_list=THEFT,STALKING,GAMBLING,DOMESTIC VIOLENCE
# Hive2 action
sumcrimetypesHiveScript=${wfDir}/sum_crime_types.hql
outputHiveDatabase=default
jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default
jdbcPrincipal=hive/_HOST@EXAMPLE.COM
EOF
Replace instances of '/your/path/to/' found in the oozieloop xml files with the loop workflow path variable '${wfDir}' [oozie@jyoung-hdp234-1 sumcrimetypes]$ find ./ -name "*.xml" -type f -exec sed -i -e 's|/your/path/to/|\${wfDir}/|g' {} \;
Create the workflow that will use oozieloops loops.xml as a sub-workflow to loop over the loop_list property and execute sumcrimetypes.xml workflow for each crime type in the loop_list [oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > loop_sumcrimetypes.xml
<workflow-app name="loop_sumcrimetypes" xmlns="uri:oozie:workflow:0.4">
<start to="loop"/>
<action name="loop">
<sub-workflow>
<app-path>${wfDir}/loop.xml</app-path>
<propagate-configuration/>
<configuration>
<property>
<name>loop_action</name>
<value>${wfDir}/sumcrimetypes.xml</value>
</property>
<property>
<name>loop_name</name>
<value>sum_crime_types</value>
</property>
</configuration>
</sub-workflow>
<ok to="end"/>
<error to="error"/>
</action>
<kill name="error">
<message>An error occurred whle executing the loop / sum_crime_types sub-workflow! Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
EOF
Create the workflow sumcrimetypes.xml that will be called in the loop. This workflow will execute a hive HQL script on HiveServer2 passing in the primaryCrimeType as a parameter to the hive query [oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sumcrimetypes.xml
<workflow-app name="sumcrimetypes_${loop_value}" xmlns="uri:oozie:workflow:0.4">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
</global>
<credentials>
<credential name="hs2-creds" type="hive2">
<property>
<name>hive2.server.principal</name>
<value>${jdbcPrincipal}</value>
</property>
<property>
<name>hive2.jdbc.url</name>
<value>${jdbcURL}</value>
</property>
</credential>
</credentials>
<start to="sumcrimetypes"/>
<action name="sumcrimetypes" cred="hs2-creds">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<jdbc-url>${jdbcURL}</jdbc-url>
<script>${sumcrimetypesHiveScript}</script>
<param>primaryCrimeType=${loop_value}</param>
</hive2>
<ok to="end"/>
<error to="error"/>
</action>
<kill name="error">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
EOF
Create the HQL file sum_crime_types.hql which will contain our parameterized Hive queries for summing the number of records per specified crime type and inserting the result into a new table - crimespertype [oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sum_crime_types.hql
CREATE TABLE IF NOT EXISTS crimespertype(primary_type STRING, number_of_crimes INT);
INSERT INTO crimespertype SELECT '${primaryCrimeType}' AS primary_type, count(*) AS number_of_crimes FROM crime WHERE Primary_Type='${primaryCrimeType}';
EOF
Create an empty lib dir [oozie@jyoung-hdp234-1 sumcrimetypes]$ mkdir -p lib
Copy the oozie workflow directory to HDFS [oozie@jyoung-hdp234-1 sumcrimetypes]$ cd ../
[oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal sumcrimetypes /user/oozie/ooziedemo/
[oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -ls -R /user/oozie/ooziedemo/sumcrimetypes
drwxr-xr-x - oozie hdfs 0 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/lib
-rw-r--r-- 3 oozie hdfs 1861 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop.xml
-rw-r--r-- 3 oozie hdfs 4853 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_list_step.xml
-rw-r--r-- 3 oozie hdfs 3912 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_range_step.xml
-rw-r--r-- 3 oozie hdfs 952 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_sumcrimetypes.xml
-rw-r--r-- 3 oozie hdfs 240 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sum_crime_types.hql
-rw-r--r-- 3 oozie hdfs 580 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sumcrimetypes.properties
-rw-r--r-- 3 oozie hdfs 1026 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sumcrimetypes.xml
Run the oozie job [oozie@jyoung-hdp234-1 ooziedemo]$ oozie job -run -config sumcrimetypes/sumcrimetypes.properties -verbose -debug -auth kerberos
job: 0000059-161213015814745-oozie-oozi-W
Watch the job info and progress [oozie@jyoung-hdp234-1 ooziedemo]$ watch -d "oozie job -info 0000059-161213015814745-oozie-oozi-W"
Verification Check the results in HiveServer 2 via beeline [hive@jyoung-hdp234-2 hive]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "select * from crimespertype;"
+-----------------------------+---------------------------------+--+
| crimespertype.primary_type | crimespertype.number_of_crimes |
+-----------------------------+---------------------------------+--+
| THEFT | 1292228 |
| STALKING | 2983 |
| GAMBLING | 14035 |
| DOMESTIC VIOLENCE | 1 |
+-----------------------------+---------------------------------+--+
4 rows selected (0.258 seconds)
... View more
Labels:
12-29-2016
01:59 AM
3 Kudos
Objective Use Oozie's Hive 2 Action to create a workflow which will connect to Hive Serve 2 in a Kerberized environment. Execute a Hive query script which will sum the number of crimes in the crimes database table for a particular year - passed in as a parameter queryYear in the job.properties file. Write the results to a new Hive table - crimenumbers Procedure Log into the edge server containing the Oozie client. Change users to the oozie user. [root@jyoung-hdp234-1 ~]# su - oozie
Authenticate to the KDC using the oozie service account kerberos keytab [oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM
Download the City of Chicago crime data in CSV form. [oozie@jyoung-hdp234-1 ~]$ mkdir -p /tmp/crime
[oozie@jyoung-hdp234-1 ~]$ cd /tmp/crime
[oozie@jyoung-hdp234-1 crime]$ curl -o crime -L https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD
Put the crime csv into HDFS into the /tmp/crime folder. [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -mkdir -p /tmp/crime
[oozie@jyoung-hdp234-1 crime]$ hdfs dfs -copyFromLocal crime /tmp/crime/
[hdfs@jyoung-hdp234-1 tmp]$ hdfs dfs -chmod -R 777 /tmp/crime
[oozie@jyoung-hdp234-1 crime]$ hdfs dfs -chmod -R 777 /tmp/crime
[oozie@jyoung-hdp234-1 crime]$ hdfs dfs -ls /tmp/crime
Found 1 items
drwxrwxrwx - oozie hdfs 0 2016-12-19 08:32 /tmp/crime/crime
Log into the Hive server. Change users to the hive user. [root@jyoung-hdp234-2 ~]# su - hive
Authenticate to the KDC using the hive service account kerberos keytab [hive@jyoung-hdp234-2 ~]$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/jyoung-hdp234-2.openstacklocal@EXAMPLE.COM
Create the SQL DDL script that will create the schema of the crime Hive table as an external table based on the crime csv located in HDFS. [hive@jyoung-hdp234-2 ~]$ cat << 'EOF' > /tmp/load_crime_table.ddl
CREATE EXTERNAL TABLE IF NOT EXISTS crime(
ID STRING,
Case_Number STRING,
Case_Date STRING,
Block STRING,
IUCR INT,
Primary_Type STRING,
Description STRING,
Location_Description STRING,
Arrest BOOLEAN,
Domestic BOOLEAN,
Beat STRING,
District STRING,
Ward STRING,
Community_Area STRING,
FBI_Code STRING,
X_Coordinate INT,
Y_Coordinate INT,
Case_Year INT,
Updated_On STRING,
Latitude DOUBLE,
Longitude DOUBLE,
Location STRING)
COMMENT 'This is crime data for the city of Chicago.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/tmp/crime'
TBLPROPERTIES("skip.header.line.count"="1");
EOF
Use beeline to execute the DDL and create the external Hive table. [hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -f "/tmp/load_crime_table.ddl"
On the Oozier server / edge node, create an Oozie workflow directory for this and future oozie demo workflows [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -mkdir -p /user/oozie/ooziedemo
Create a local folder to hold development copies of your Oozie workflow project files [oozie@jyoung-hdp234-1 crime]$ cd ~/
[oozie@jyoung-hdp234-1 ~]$ mkdir -p ooziedemo/hivedemo/app/lib
[oozie@jyoung-hdp234-1 ~]$ cd ooziedemo/hivedemo
Create the job.properties file that will contain the configuration properties and variables for the workflow [oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > job.properties
nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020
jobTracker=jyoung-hdp234-2.openstacklocal:8050
exampleDir=${nameNode}/user/${user.name}/ooziedemo/hivedemo
oozie.wf.application.path=${exampleDir}/app
oozie.use.system.libpath=true
# Hive2 action
hivescript=${oozie.wf.application.path}/crime_per_year.hql
outputHiveDatabase=default
jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default
jdbcPrincipal=hive/_HOST@EXAMPLE.COM
queryYear=2008
EOF
Create the workflow.xml which will execute an HQL script on Hive Server 2 [oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > app/workflow.xml
<workflow-app name="hivedemo" xmlns="uri:oozie:workflow:0.4">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
</global>
<credentials>
<credential name="hs2-creds" type="hive2">
<property>
<name>hive2.server.principal</name>
<value>${jdbcPrincipal}</value>
</property>
<property>
<name>hive2.jdbc.url</name>
<value>${jdbcURL}</value>
</property>
</credential>
</credentials>
<start to="hive2"/>
<action name="hive2" cred="hs2-creds">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<jdbc-url>${jdbcURL}</jdbc-url>
<script>${hivescript}</script>
<param>queryYear=${queryYear}</param>
</hive2>
<ok to="End"/>
<error to="Kill"/>
</action>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="End"/>
</workflow-app>
EOF
Create the HQL script that will contain the parameterized Hive query to be executed by the workflow [oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > app/crime_per_year.hql
CREATE TABLE IF NOT EXISTS crimenumbers(year INT, number_of_crimes INT);
INSERT INTO crimenumbers SELECT ${queryYear} as year, count(*) as number_of_crimes FROM crime WHERE case_date LIKE '%${queryYear}%';
EOF
Copy the hivedemo folder to HDFS [oozie@jyoung-hdp234-1 hivedemo]$ cd ~/ooziedemo
[oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal hivedemo /user/oozie/ooziedemo/
[oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -ls -R /user/oozie/ooziedemo/
drwxr-xr-x - oozie hdfs 0 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo
drwxr-xr-x - oozie hdfs 0 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app
-rw-r--r-- 3 oozie hdfs 206 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app/crime_per_year.hql
drwxr-xr-x - oozie hdfs 0 2016-12-19 08:54 /user/oozie/ooziedemo/hivedemo/app/lib
-rw-r--r-- 3 oozie hdfs 968 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app/workflow.xml
-rw-r--r-- 3 oozie hdfs 452 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/job.properties
Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie every time we run the oozie command. [oozie@jyoung-hdp234-1 hivedemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie
Run the oozie job oozie@jyoung-hdp234-1 ooziedemo]$ cd hivedemo
[oozie@jyoung-hdp234-1 hivedemo]$ oozie job -run -config job.properties -verbose -debug -auth kerberos
...
job: 0000099-161213015814745-oozie-oozi-W
Watch the job info and progress [oozie@jyoung-hdp234-1 hivedemo]$ watch -d "oozie job -info 0000099-161213015814745-oozie-oozi-W"
Verification Check the results in HiveServer 2 via beeline [hive@jyoung-hdp234-2 hive]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "select * from crimenumbers;"
+--------------------+--------------------------------+--+
| crimenumbers.year | crimenumbers.number_of_crimes |
+--------------------+--------------------------------+--+
| 2008 | 426960 |
+--------------------+--------------------------------+--+
1 row selected (0.286 seconds)
... View more
Labels:
12-29-2016
01:49 AM
1 Kudo
Here I have created a series of demos to highlight workflows that cover examples of using: Hive 2 Action in a Kerberized cluster Jeremy Beard's oozieloop project to simulate looping with sub-workflows Decision nodes to simulate conditional operators (if-then-else) These workflow demo hands-on wikis and source code are hosted on my GitHub repository at: https://github.com/jlyoung/advancedoozieworkflows
... View more
Labels: