About jyoung

jyoung · ‎01-18-2017

Hello @hardik desai You can get similar information from the /var/log/hadoop/hdfs/hdfs-audit.log log file. However rm would show up as a rename command. ls would show up as listStatus, etc. Here I create a file /tmp/deleteme.txt, upload it to HDFS, list it parent directory, and delete the file. [hdfs@jyoung-hdp234-1 ~]$ echo "delete me" >> /tmp/deleteme.txt [hdfs@jyoung-hdp234-1 ~]$ hdfs dfs -put /tmp/deleteme.txt /tmp/ [hdfs@jyoung-hdp234-1 ~]$ hdfs dfs -ls /tmp Found 13 items drwx------ - ambari-qa hdfs 0 2016-12-01 04:19 /tmp/ambari-qa drwxrwxrwx - oozie hdfs 0 2016-12-19 08:32 /tmp/crime -rw-r--r-- 3 hdfs hdfs 10 2017-01-18 08:29 /tmp/deleteme.txt drwxr-xr-x - hdfs hdfs 0 2016-12-01 04:15 /tmp/entity-file-history drwx-wx-wx - ambari-qa hdfs 0 2016-12-11 18:36 /tmp/hive -rwxr-xr-x 3 hdfs hdfs 1616 2016-12-01 04:16 /tmp/id1aacdf51_date160116 -rwxr-xr-x 3 hdfs hdfs 1616 2016-12-01 05:19 /tmp/id1aacdf51_date190116 -rwxr-xr-x 3 ambari-qa hdfs 1616 2016-12-01 04:21 /tmp/idtest.ambari-qa.1480566109.56.in -rwxr-xr-x 3 ambari-qa hdfs 957 2016-12-01 04:21 /tmp/idtest.ambari-qa.1480566109.56.pig -rwxr-xr-x 3 ambari-qa hdfs 1616 2016-12-01 05:23 /tmp/idtest.ambari-qa.1480569805.86.in -rwxr-xr-x 3 ambari-qa hdfs 957 2016-12-01 05:23 /tmp/idtest.ambari-qa.1480569805.86.pig drwxr-xr-x - ambari-qa hdfs 0 2016-12-01 04:19 /tmp/tezsmokeinput drwxr-xr-x - ambari-qa hdfs 0 2016-12-01 05:21 /tmp/tezsmokeoutput [hdfs@jyoung-hdp234-1 ~]$ hdfs dfs -rm /tmp/deleteme.txt 17/01/18 08:29:46 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes. Moved: 'hdfs://jyoung-hdp234-1.openstacklocal:8020/tmp/deleteme.txt' to trash at: hdfs://jyoung-hdp234-1.openstacklocal:8020/user/hdfs/.Trash/Current You can see these operations in my hdfs-audit.log file below: [root@jyoung-hdp234-1 hdfs]# grep "CLI" /var/log/hadoop/hdfs/hdfs-audit.log 2017-01-18 08:28:29,254 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:13,847 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:13,942 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:13,947 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt._COPYING_ dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:13,990 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=create src=/tmp/deleteme.txt._COPYING_ dst=null perm=hdfs:hdfs:rw-r--r-- proto=rpc callerContext=CLI 2017-01-18 08:29:14,022 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt._COPYING_ dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:14,240 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=rename src=/tmp/deleteme.txt._COPYING_ dst=/tmp/deleteme.txt perm=hdfs:hdfs:rw-r--r-- proto=rpc callerContext=CLI 2017-01-18 08:29:27,129 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:27,247 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=listStatus src=/tmp dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:27,466 INFO FSNamesystem.audit: allowed=true ugi=hbase/jyoung-hdp234-3.openstacklocal@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.225 cmd=listStatus src=/apps/hbase/data/oldWALs dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:27,469 INFO FSNamesystem.audit: allowed=true ugi=hbase/jyoung-hdp234-3.openstacklocal@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.225 cmd=listStatus src=/apps/hbase/data/archive dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:46,457 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:46,552 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:46,570 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/tmp/deleteme.txt dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:46,585 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=mkdirs src=/user/hdfs/.Trash/Current/tmp dst=null perm=hdfs:hdfs:rwx------ proto=rpc callerContext=CLI 2017-01-18 08:29:46,590 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=getfileinfo src=/user/hdfs/.Trash/Current/tmp/deleteme.txt dst=null perm=null proto=rpc callerContext=CLI 2017-01-18 08:29:46,595 INFO FSNamesystem.audit: allowed=true ugi=hdfs-cluster1@EXAMPLE.COM (auth:KERBEROS) ip=/172.26.81.223 cmd=rename src=/tmp/deleteme.txt dst=/user/hdfs/.Trash/Current/tmp/deleteme.txt perm=hdfs:hdfs:rw-r--r-- proto=rpc callerContext=CLI

jyoung · ‎01-10-2017

I'm happy to hear that worked out for you. Feel free to accept the answer if you're happy with it. Thanks!

jyoung · ‎01-09-2017

Hello. If you're not using Off-heap memory (Bucketcache) you can try disabling the 3 configuration properties and 1 environment variable setting that gets added during the Ambari upgrade. Using Ambari, modify your Hbase configuration and blank the following: hbase.bucketcache.size hbase.bucketcache.ioengine hbase.bucketcache.percentage.in.combinedcache Modify hbase-env template: comment out line: export HBASE_REGIONSERVER_OPTS = ... -XX:MaxDirectMemorySize Restart all affected

jyoung · ‎12-29-2016

Objective Accept a parameter -DfileType=[csv|tsv] from the Oozie command line. Use Oozie'sdecision node functionality to simulate an if-then-else conditional operation. If the value of the fileType variable equals tsv , execute a Hive 2 action which will execute the load_policestationstsv.ddl which will in-turn load a tab-separated-value file policestations.tsv into a Hive table named policestationstsv . Else, if the value of the fileType variable equals csv , execute a Hive 2 action which will execute the load_policestationscsv.ddl which will in-turn load a comma-separated-value file policestations.csv into a Hive table named policestationscsv . These Hive 2 actions will drop any pre-existing policestationstsv or policestationscsv tables from Hive as a preparatory step each time this workflow is run. Procedure On an edge node containing the oozie client, change users to the oozie user [root@jyoung-hdp234-1 ~]# su - oozie Authenticate to the KDC using the oozie service account kerberos keytab [oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM Create a local directory to hold app workflow files, properties files, Hive DDLs and TSV/CSV data files [oozie@jyoung-hdp234-1 ~]$ cd ooziedemo [oozie@jyoung-hdp234-1 ooziedemo]$ mkdir -p decisiondemo [oozie@jyoung-hdp234-1 ooziedemo]$ cd decisiondemo/ [oozie@jyoung-hdp234-1 decisiondemo]$ mkdir -p {policestationstsv,policestationscsv} Download the City of Chicago Police Stations data in TSV form. [oozie@jyoung-hdp234-1 decisiondemo]$ cd policestationstsv/ [oozie@jyoung-hdp234-1 policestationstsv]$ curl -L -o policestations.tsv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.tsv?accessType=DOWNLOAD [oozie@jyoung-hdp234-1 policestationstsv]$ head -n 5 policestations.tsv DISTRICT DISTRICT NAME ADDRESS CITY STATE ZIP WEBSITE PHONE FAX TTY X COORDINATE Y COORDINATE LATITUDE LONGITUDE LOCATION 1 Central 1718 S State St Chicago IL 60616 http://home.chicagopolice.org/community/districts/1st-district-central/ 312-745-4290 312-745-3694 312-745-3693 1176569.052 1891771.704 41.85837259 -87.62735617 (41.8583725929, -87.627356171) 2 Wentworth 5101 S Wentworth Ave Chicago IL 60609 http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ 312-747-8366 312-747-5396 312-747-6656 1175864.837 1871153.753 41.80181109 -87.63056018 (41.8018110912, -87.6305601801) 3 Grand Crossing 7040 S Cottage Grove Ave Chicago IL 60637 http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ 312-747-8201 312-747-5479 312-747-9168 1182739.183 1858317.732 41.76643089 -87.60574786 (41.7664308925, -87.6057478606) 4 South Chicago 2255 E 103rd St Chicago IL 60617 http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ 312-747-7581 312-747-5276 312-747-9169 1193131.299 1837090.265 41.70793329 -87.56834912 (41.7079332906, -87.5683491228) Download the City of Chicago Police Stations data in CSV form. [oozie@jyoung-hdp234-1 policestationstsv]$ cd ../ [oozie@jyoung-hdp234-1 decisiondemo]$ cd policestationscsv/ [oozie@jyoung-hdp234-1 policestationscsv]$ curl -L -o policestations.csv https://data.cityofchicago.org/api/views/z8bn-74gv/rows.csv?accessType=DOWNLOAD [oozie@jyoung-hdp234-1 policestationscsv]$ head -n 5 policestations.csv DISTRICT,DISTRICT NAME,ADDRESS,CITY,STATE,ZIP,WEBSITE,PHONE,FAX,TTY,X COORDINATE,Y COORDINATE,LATITUDE,LONGITUDE,LOCATION 1,Central,1718 S State St,Chicago,IL,60616,http://home.chicagopolice.org/community/districts/1st-district-central/,312-745-4290,312-745-3694,312-745-3693,1176569.052,1891771.704,41.85837259,-87.62735617,"(41.8583725929, -87.627356171)" 2,Wentworth,5101 S Wentworth Ave,Chicago,IL,60609,http://home.chicagopolice.org/community/districts/2nd-district-wentworth/,312-747-8366,312-747-5396,312-747-6656,1175864.837,1871153.753,41.80181109,-87.63056018,"(41.8018110912, -87.6305601801)" 3,Grand Crossing,7040 S Cottage Grove Ave,Chicago,IL,60637,http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/,312-747-8201,312-747-5479,312-747-9168,1182739.183,1858317.732,41.76643089,-87.60574786,"(41.7664308925, -87.6057478606)" 4,South Chicago,2255 E 103rd St,Chicago,IL,60617,http://home.chicagopolice.org/community/districts/4th-district-south-chicago/,312-747-7581,312-747-5276,312-747-9169,1193131.299,1837090.265,41.70793329,-87.56834912,"(41.7079332906, -87.5683491228)" Create the SQL DDL script that will create the schema of the policestationstsv Hive table as an external table based on the policestations.tsv TSV file located in HDFS. [oozie@jyoung-hdp234-1 policestationscsv]$ cd ../ [oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > load_policestationstsv.ddl DROP TABLE policestationstsv; DROP TABLE policestationscsv; CREATE EXTERNAL TABLE IF NOT EXISTS policestationstsv( DISTRICT INT, DISTRICT_NAME STRING, ADDRESS STRING, CITY STRING, STATE STRING, ZIP STRING, WEBSITE STRING, PHONE STRING, FAX STRING, TTY STRING, X_COORDINATE DOUBLE, Y_COORDINATE DOUBLE, LATITUDE DOUBLE, LONGITUDE DOUBLE, LOCATION STRING) COMMENT 'This is police station data for the city of Chicago.' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/oozie/ooziedemo/decisiondemo/policestationstsv' TBLPROPERTIES("skip.header.line.count"="1"); EOF Create the SQL DDL script that will create the schema of the policestationscsv Hive table as an external table based on the policestations.csv CSV file located in HDFS. [oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > load_policestationscsv.ddl DROP TABLE policestationstsv; DROP TABLE policestationscsv; CREATE EXTERNAL TABLE IF NOT EXISTS policestationscsv( DISTRICT INT, DISTRICT_NAME STRING, ADDRESS STRING, CITY STRING, STATE STRING, ZIP STRING, WEBSITE STRING, PHONE STRING, FAX STRING, TTY STRING, X_COORDINATE DOUBLE, Y_COORDINATE DOUBLE, LATITUDE DOUBLE, LONGITUDE DOUBLE, LOCATION STRING) COMMENT 'This is police station data for the city of Chicago.' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION '/user/oozie/ooziedemo/decisiondemo/policestationscsv' TBLPROPERTIES("skip.header.line.count"="1"); EOF Create the job.properties file that will contain the configuration properties and variables for the workflow [oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > job.properties # Job.properties file # Workflow to run nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020 jobTracker=jyoung-hdp234-2.openstacklocal:8050 wfDir=${nameNode}/user/${user.name}/ooziedemo/decisiondemo oozie.wf.application.path=${wfDir}/workflow.xml oozie.use.system.libpath=true fileType=csv # Hive2 action loadTSVHiveScript=${wfDir}/load_policestationstsv.ddl loadCSVHiveScript=${wfDir}/load_policestationscsv.ddl outputHiveDatabase=default jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default jdbcPrincipal=hive/_HOST@EXAMPLE.COM EOF Create the workflow.xml which will use the decision node to execute a particular Hive DDL script on Hive Server 2 based on whether the fileType variable equals tsv or csv . We're running Hive in a Kerberized environment so we include a credentials section at the top to ensure Oozie's delegation token is issued and used by Hive. [oozie@jyoung-hdp234-1 decisiondemo]$ cat << 'EOF' > workflow.xml <workflow-app name="decisionexample" xmlns="uri:oozie:workflow:0.4"> <global> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> </global> <credentials> <credential name="hs2-creds" type="hive2"> <property> <name>hive2.server.principal</name> <value>${jdbcPrincipal}</value> </property> <property> <name>hive2.jdbc.url</name> <value>${jdbcURL}</value> </property> </credential> </credentials> <start to="if-filetype"/> <decision name="if-filetype"> <switch> <case to="load-tsv">${fileType eq "tsv"}</case> <case to="load-csv">${fileType eq "csv"}</case> <default to="load-csv"/> </switch> </decision> <action name="load-tsv" cred="hs2-creds"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <jdbc-url>${jdbcURL}</jdbc-url> <script>${loadTSVHiveScript}</script> </hive2> <ok to="End"/> <error to="Kill"/> </action> <action name="load-csv" cred="hs2-creds"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <jdbc-url>${jdbcURL}</jdbc-url> <script>${loadCSVHiveScript}</script> </hive2> <ok to="End"/> <error to="Kill"/> </action> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="End"/> </workflow-app> EOF Include an empty lib folder to avoid lib does not exist errors. [oozie@jyoung-hdp234-1 decisiondemo]$ mkdir -p lib Copy the decisiondemo folder to HDFS [oozie@jyoung-hdp234-1 decisiondemo]$ cd ../ [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal decisiondemo /user/oozie/ooziedemo/ [oozie@jyoung-hdp234-1 ooziedemo]$ cd decisiondemo/ Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie every time we run the oozie command. [oozie@jyoung-hdp234-1 decisiondemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie Run the oozie job passing in -DfileType=tsv to set the value of the fileType property equal to tsv . Afterwards, run the oozie job again passing in -DfileType=csv instead to test out the CSV decision path. [oozie@jyoung-hdp234-1 decisiondemo]$ oozie job -run -config job.properties -verbose -debug -auth kerberos -DfileType=tsv ... job: 0000101-161213015814745-oozie-oozi-W Watch the job info and progress [oozie@jyoung-hdp234-1 decisiondemo]$ watch -d "oozie job -info 0000101-161213015814745-oozie-oozi-W" Verification Before running Oozie job with -DfileType=tsv command line argument [root@jyoung-hdp234-2 ~]# su - hive [hive@jyoung-hdp234-2 ~]$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/jyoung-hdp234-2.openstacklocal@EXAMPLE.COM [hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables;" WARNING: Use "yarn jar" to launch YARN applications. Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485) Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485) Transaction isolation: TRANSACTION_REPEATABLE_READ +---------------+--+ | tab_name | +---------------+--+ | crime | | crimenumbers | +---------------+--+ 2 rows selected (0.163 seconds) Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM After running Oozie job with -DfileType=tsv command line argument [hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables; select * from policestationstsv limit 5;" WARNING: Use "yarn jar" to launch YARN applications. Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485) Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485) Transaction isolation: TRANSACTION_REPEATABLE_READ +--------------------+--+ | tab_name | +--------------------+--+ | crime | | crimenumbers | | policestationstsv | +--------------------+--+ 3 rows selected (0.138 seconds) +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | policestationstsv.district | policestationstsv.district_name | policestationstsv.address | policestationstsv.city | policestationstsv.state | policestationstsv.zip | policestationstsv.website | policestationstsv.phone | policestationstsv.fax | policestationstsv.tty | policestationstsv.x_coordinate | policestationstsv.y_coordinate | policestationstsv.latitude | policestationstsv.longitude | policestationstsv.location | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | 1 | Central | 1718 S State St | Chicago | IL | 60616 | http://home.chicagopolice.org/community/districts/1st-district-central/ | 312-745-4290 | 312-745-3694 | 312-745-3693 | 1176569.052 | 1891771.704 | 41.85837259 | -87.62735617 | (41.8583725929, -87.627356171) | | 2 | Wentworth | 5101 S Wentworth Ave | Chicago | IL | 60609 | http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ | 312-747-8366 | 312-747-5396 | 312-747-6656 | 1175864.837 | 1871153.753 | 41.80181109 | -87.63056018 | (41.8018110912, -87.6305601801) | | 3 | Grand Crossing | 7040 S Cottage Grove Ave | Chicago | IL | 60637 | http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ | 312-747-8201 | 312-747-5479 | 312-747-9168 | 1182739.183 | 1858317.732 | 41.76643089 | -87.60574786 | (41.7664308925, -87.6057478606) | | 4 | South Chicago | 2255 E 103rd St | Chicago | IL | 60617 | http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ | 312-747-7581 | 312-747-5276 | 312-747-9169 | 1193131.299 | 1837090.265 | 41.70793329 | -87.56834912 | (41.7079332906, -87.5683491228) | | 5 | Calumet | 727 E 111th St | Chicago | IL | 60628 | http://home.chicagopolice.org/community/districts/5th-district-calumet/ | 312-747-8210 | 312-747-5935 | 312-747-9170 | 1183305.427 | 1831462.313 | 41.69272336 | -87.60450587 | (41.6927233639, -87.6045058667) | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ 5 rows selected (0.364 seconds) Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM After running Oozie job with -DfileType=csv command line argument [hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "show tables; select * from policestationscsv limit 5;" WARNING: Use "yarn jar" to launch YARN applications. Connecting to jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM Connected to: Apache Hive (version 1.2.1.2.3.4.0-3485) Driver: Hive JDBC (version 1.2.1.2.3.4.0-3485) Transaction isolation: TRANSACTION_REPEATABLE_READ +--------------------+--+ | tab_name | +--------------------+--+ | crime | | crimenumbers | | policestationscsv | +--------------------+--+ 3 rows selected (0.131 seconds) +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | policestationscsv.district | policestationscsv.district_name | policestationscsv.address | policestationscsv.city | policestationscsv.state | policestationscsv.zip | policestationscsv.website | policestationscsv.phone | policestationscsv.fax | policestationscsv.tty | policestationscsv.x_coordinate | policestationscsv.y_coordinate | policestationscsv.latitude | policestationscsv.longitude | policestationscsv.location | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ | 1 | Central | 1718 S State St | Chicago | IL | 60616 | http://home.chicagopolice.org/community/districts/1st-district-central/ | 312-745-4290 | 312-745-3694 | 312-745-3693 | 1176569.052 | 1891771.704 | 41.85837259 | -87.62735617 | (41.8583725929, -87.627356171) | | 2 | Wentworth | 5101 S Wentworth Ave | Chicago | IL | 60609 | http://home.chicagopolice.org/community/districts/2nd-district-wentworth/ | 312-747-8366 | 312-747-5396 | 312-747-6656 | 1175864.837 | 1871153.753 | 41.80181109 | -87.63056018 | (41.8018110912, -87.6305601801) | | 3 | Grand Crossing | 7040 S Cottage Grove Ave | Chicago | IL | 60637 | http://home.chicagopolice.org/community/districts/3rd-district-grand-crossing/ | 312-747-8201 | 312-747-5479 | 312-747-9168 | 1182739.183 | 1858317.732 | 41.76643089 | -87.60574786 | (41.7664308925, -87.6057478606) | | 4 | South Chicago | 2255 E 103rd St | Chicago | IL | 60617 | http://home.chicagopolice.org/community/districts/4th-district-south-chicago/ | 312-747-7581 | 312-747-5276 | 312-747-9169 | 1193131.299 | 1837090.265 | 41.70793329 | -87.56834912 | (41.7079332906, -87.5683491228) | | 5 | Calumet | 727 E 111th St | Chicago | IL | 60628 | http://home.chicagopolice.org/community/districts/5th-district-calumet/ | 312-747-8210 | 312-747-5935 | 312-747-9170 | 1183305.427 | 1831462.313 | 41.69272336 | -87.60450587 | (41.6927233639, -87.6045058667) | +-----------------------------+----------------------------------+----------------------------+-------------------------+--------------------------+------------------------+---------------------------------------------------------------------------------+--------------------------+------------------------+------------------------+---------------------------------+---------------------------------+-----------------------------+------------------------------+----------------------------------+--+ 5 rows selected (0.116 seconds) Beeline version 1.2.1.2.3.4.0-3485 by Apache Hive Closing: 0: jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM

jyoung · ‎12-29-2016

Objective Loop over a list of primary crime types ("THEFT", "STALKING", "GAMBLING", "DOMESTIC VIOLENCE"). For each crime type, query hive, get the sum of that primary crime type from the crime table, insert the sum into a new table (crimespertype). This demo uses the oozieloop project written by Jeremy Beard in order to emulate looping in Oozie via sub-workflows. Please see the oozieloop project home page for a detailed explanation of how it works: https://github.com/jeremybeard/oozieloop IMPORTANT!: This demo requires the crime database table existing in Hive as created in the first exercise: Hive-2-Action-in-a-Kerberized-cluster Procedure On an edge node containing the oozie client, install git if it doesn't already exist [root@jyoung-hdp234-1 ~]# yum install git-all Change users to the oozie user [root@jyoung-hdp234-1 ~]# su - oozie Authenticate to the KDC using the oozie service account kerberos keytab [oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie every time we run the oozie command. [oozie@jyoung-hdp234-1 ooziedemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie Git-clone the oozieloop repository [oozie@jyoung-hdp234-1 ooziedemo]$ git clone https://github.com/jeremybeard/oozieloop.git Create a local directory to hold app workflow files [oozie@jyoung-hdp234-1 ooziedemo]$ mkdir -p sumcrimetypes Copy the oozieloop xml files to your workflow directory [oozie@jyoung-hdp234-1 ooziedemo]$ cp oozieloop/*.xml sumcrimetypes/ Create a job properties file. Include a special key-value "loop_list" which will contain the list of values to loop over. [oozie@jyoung-hdp234-1 ooziedemo]$ cd sumcrimetypes/ [oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sumcrimetypes.properties nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020 jobTracker=jyoung-hdp234-2.openstacklocal:8050 wfDir=${nameNode}/user/${user.name}/ooziedemo/sumcrimetypes oozie.wf.application.path=${wfDir}/loop_sumcrimetypes.xml oozie.use.system.libpath=true loopWorkflowPath=${wfDir}/loop_crime_types.xml loop_parallel=false loop_type=list loop_list=THEFT,STALKING,GAMBLING,DOMESTIC VIOLENCE # Hive2 action sumcrimetypesHiveScript=${wfDir}/sum_crime_types.hql outputHiveDatabase=default jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default jdbcPrincipal=hive/_HOST@EXAMPLE.COM EOF Replace instances of '/your/path/to/' found in the oozieloop xml files with the loop workflow path variable '${wfDir}' [oozie@jyoung-hdp234-1 sumcrimetypes]$ find ./ -name "*.xml" -type f -exec sed -i -e 's|/your/path/to/|\${wfDir}/|g' {} \; Create the workflow that will use oozieloops loops.xml as a sub-workflow to loop over the loop_list property and execute sumcrimetypes.xml workflow for each crime type in the loop_list [oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > loop_sumcrimetypes.xml <workflow-app name="loop_sumcrimetypes" xmlns="uri:oozie:workflow:0.4"> <start to="loop"/> <action name="loop"> <sub-workflow> <app-path>${wfDir}/loop.xml</app-path> <propagate-configuration/> <configuration> <property> <name>loop_action</name> <value>${wfDir}/sumcrimetypes.xml</value> </property> <property> <name>loop_name</name> <value>sum_crime_types</value> </property> </configuration> </sub-workflow> <ok to="end"/> <error to="error"/> </action> <kill name="error"> <message>An error occurred whle executing the loop / sum_crime_types sub-workflow! Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> EOF Create the workflow sumcrimetypes.xml that will be called in the loop. This workflow will execute a hive HQL script on HiveServer2 passing in the primaryCrimeType as a parameter to the hive query [oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sumcrimetypes.xml <workflow-app name="sumcrimetypes_${loop_value}" xmlns="uri:oozie:workflow:0.4"> <global> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> </global> <credentials> <credential name="hs2-creds" type="hive2"> <property> <name>hive2.server.principal</name> <value>${jdbcPrincipal}</value> </property> <property> <name>hive2.jdbc.url</name> <value>${jdbcURL}</value> </property> </credential> </credentials> <start to="sumcrimetypes"/> <action name="sumcrimetypes" cred="hs2-creds"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <jdbc-url>${jdbcURL}</jdbc-url> <script>${sumcrimetypesHiveScript}</script> <param>primaryCrimeType=${loop_value}</param> </hive2> <ok to="end"/> <error to="error"/> </action> <kill name="error"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> EOF Create the HQL file sum_crime_types.hql which will contain our parameterized Hive queries for summing the number of records per specified crime type and inserting the result into a new table - crimespertype [oozie@jyoung-hdp234-1 sumcrimetypes]$ cat << 'EOF' > sum_crime_types.hql CREATE TABLE IF NOT EXISTS crimespertype(primary_type STRING, number_of_crimes INT); INSERT INTO crimespertype SELECT '${primaryCrimeType}' AS primary_type, count(*) AS number_of_crimes FROM crime WHERE Primary_Type='${primaryCrimeType}'; EOF Create an empty lib dir [oozie@jyoung-hdp234-1 sumcrimetypes]$ mkdir -p lib Copy the oozie workflow directory to HDFS [oozie@jyoung-hdp234-1 sumcrimetypes]$ cd ../ [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal sumcrimetypes /user/oozie/ooziedemo/ [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -ls -R /user/oozie/ooziedemo/sumcrimetypes drwxr-xr-x - oozie hdfs 0 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/lib -rw-r--r-- 3 oozie hdfs 1861 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop.xml -rw-r--r-- 3 oozie hdfs 4853 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_list_step.xml -rw-r--r-- 3 oozie hdfs 3912 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_range_step.xml -rw-r--r-- 3 oozie hdfs 952 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/loop_sumcrimetypes.xml -rw-r--r-- 3 oozie hdfs 240 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sum_crime_types.hql -rw-r--r-- 3 oozie hdfs 580 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sumcrimetypes.properties -rw-r--r-- 3 oozie hdfs 1026 2016-12-18 04:48 /user/oozie/ooziedemo/sumcrimetypes/sumcrimetypes.xml Run the oozie job [oozie@jyoung-hdp234-1 ooziedemo]$ oozie job -run -config sumcrimetypes/sumcrimetypes.properties -verbose -debug -auth kerberos job: 0000059-161213015814745-oozie-oozi-W Watch the job info and progress [oozie@jyoung-hdp234-1 ooziedemo]$ watch -d "oozie job -info 0000059-161213015814745-oozie-oozi-W" Verification Check the results in HiveServer 2 via beeline [hive@jyoung-hdp234-2 hive]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "select * from crimespertype;" +-----------------------------+---------------------------------+--+ | crimespertype.primary_type | crimespertype.number_of_crimes | +-----------------------------+---------------------------------+--+ | THEFT | 1292228 | | STALKING | 2983 | | GAMBLING | 14035 | | DOMESTIC VIOLENCE | 1 | +-----------------------------+---------------------------------+--+ 4 rows selected (0.258 seconds)

jyoung · ‎12-29-2016

Objective Use Oozie's Hive 2 Action to create a workflow which will connect to Hive Serve 2 in a Kerberized environment. Execute a Hive query script which will sum the number of crimes in the crimes database table for a particular year - passed in as a parameter queryYear in the job.properties file. Write the results to a new Hive table - crimenumbers Procedure Log into the edge server containing the Oozie client. Change users to the oozie user. [root@jyoung-hdp234-1 ~]# su - oozie Authenticate to the KDC using the oozie service account kerberos keytab [oozie@jyoung-hdp234-1 ~]$ kinit -kt /etc/security/keytabs/oozie.service.keytab oozie/jyoung-hdp234-1.openstacklocal@EXAMPLE.COM Download the City of Chicago crime data in CSV form. [oozie@jyoung-hdp234-1 ~]$ mkdir -p /tmp/crime [oozie@jyoung-hdp234-1 ~]$ cd /tmp/crime [oozie@jyoung-hdp234-1 crime]$ curl -o crime -L https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD Put the crime csv into HDFS into the /tmp/crime folder. [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -mkdir -p /tmp/crime [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -copyFromLocal crime /tmp/crime/ [hdfs@jyoung-hdp234-1 tmp]$ hdfs dfs -chmod -R 777 /tmp/crime [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -chmod -R 777 /tmp/crime [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -ls /tmp/crime Found 1 items drwxrwxrwx - oozie hdfs 0 2016-12-19 08:32 /tmp/crime/crime Log into the Hive server. Change users to the hive user. [root@jyoung-hdp234-2 ~]# su - hive Authenticate to the KDC using the hive service account kerberos keytab [hive@jyoung-hdp234-2 ~]$ kinit -kt /etc/security/keytabs/hive.service.keytab hive/jyoung-hdp234-2.openstacklocal@EXAMPLE.COM Create the SQL DDL script that will create the schema of the crime Hive table as an external table based on the crime csv located in HDFS. [hive@jyoung-hdp234-2 ~]$ cat << 'EOF' > /tmp/load_crime_table.ddl CREATE EXTERNAL TABLE IF NOT EXISTS crime( ID STRING, Case_Number STRING, Case_Date STRING, Block STRING, IUCR INT, Primary_Type STRING, Description STRING, Location_Description STRING, Arrest BOOLEAN, Domestic BOOLEAN, Beat STRING, District STRING, Ward STRING, Community_Area STRING, FBI_Code STRING, X_Coordinate INT, Y_Coordinate INT, Case_Year INT, Updated_On STRING, Latitude DOUBLE, Longitude DOUBLE, Location STRING) COMMENT 'This is crime data for the city of Chicago.' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION '/tmp/crime' TBLPROPERTIES("skip.header.line.count"="1"); EOF Use beeline to execute the DDL and create the external Hive table. [hive@jyoung-hdp234-2 ~]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -f "/tmp/load_crime_table.ddl" On the Oozier server / edge node, create an Oozie workflow directory for this and future oozie demo workflows [oozie@jyoung-hdp234-1 crime]$ hdfs dfs -mkdir -p /user/oozie/ooziedemo Create a local folder to hold development copies of your Oozie workflow project files [oozie@jyoung-hdp234-1 crime]$ cd ~/ [oozie@jyoung-hdp234-1 ~]$ mkdir -p ooziedemo/hivedemo/app/lib [oozie@jyoung-hdp234-1 ~]$ cd ooziedemo/hivedemo Create the job.properties file that will contain the configuration properties and variables for the workflow [oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > job.properties nameNode=hdfs://jyoung-hdp234-1.openstacklocal:8020 jobTracker=jyoung-hdp234-2.openstacklocal:8050 exampleDir=${nameNode}/user/${user.name}/ooziedemo/hivedemo oozie.wf.application.path=${exampleDir}/app oozie.use.system.libpath=true # Hive2 action hivescript=${oozie.wf.application.path}/crime_per_year.hql outputHiveDatabase=default jdbcURL=jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default jdbcPrincipal=hive/_HOST@EXAMPLE.COM queryYear=2008 EOF Create the workflow.xml which will execute an HQL script on Hive Server 2 [oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > app/workflow.xml <workflow-app name="hivedemo" xmlns="uri:oozie:workflow:0.4"> <global> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> </global> <credentials> <credential name="hs2-creds" type="hive2"> <property> <name>hive2.server.principal</name> <value>${jdbcPrincipal}</value> </property> <property> <name>hive2.jdbc.url</name> <value>${jdbcURL}</value> </property> </credential> </credentials> <start to="hive2"/> <action name="hive2" cred="hs2-creds"> <hive2 xmlns="uri:oozie:hive2-action:0.1"> <jdbc-url>${jdbcURL}</jdbc-url> <script>${hivescript}</script> <param>queryYear=${queryYear}</param> </hive2> <ok to="End"/> <error to="Kill"/> </action> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="End"/> </workflow-app> EOF Create the HQL script that will contain the parameterized Hive query to be executed by the workflow [oozie@jyoung-hdp234-1 hivedemo]$ cat << 'EOF' > app/crime_per_year.hql CREATE TABLE IF NOT EXISTS crimenumbers(year INT, number_of_crimes INT); INSERT INTO crimenumbers SELECT ${queryYear} as year, count(*) as number_of_crimes FROM crime WHERE case_date LIKE '%${queryYear}%'; EOF Copy the hivedemo folder to HDFS [oozie@jyoung-hdp234-1 hivedemo]$ cd ~/ooziedemo [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -copyFromLocal hivedemo /user/oozie/ooziedemo/ [oozie@jyoung-hdp234-1 ooziedemo]$ hdfs dfs -ls -R /user/oozie/ooziedemo/ drwxr-xr-x - oozie hdfs 0 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo drwxr-xr-x - oozie hdfs 0 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app -rw-r--r-- 3 oozie hdfs 206 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app/crime_per_year.hql drwxr-xr-x - oozie hdfs 0 2016-12-19 08:54 /user/oozie/ooziedemo/hivedemo/app/lib -rw-r--r-- 3 oozie hdfs 968 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/app/workflow.xml -rw-r--r-- 3 oozie hdfs 452 2016-12-19 09:09 /user/oozie/ooziedemo/hivedemo/job.properties Set and export the OOZIE_URL environment variable so that we don't have to specify -oozie http://jyoung-hdp234-1.openstacklocal:11000/oozie every time we run the oozie command. [oozie@jyoung-hdp234-1 hivedemo]$ export OOZIE_URL=http://jyoung-hdp234-1.openstacklocal:11000/oozie Run the oozie job oozie@jyoung-hdp234-1 ooziedemo]$ cd hivedemo [oozie@jyoung-hdp234-1 hivedemo]$ oozie job -run -config job.properties -verbose -debug -auth kerberos ... job: 0000099-161213015814745-oozie-oozi-W Watch the job info and progress [oozie@jyoung-hdp234-1 hivedemo]$ watch -d "oozie job -info 0000099-161213015814745-oozie-oozi-W" Verification Check the results in HiveServer 2 via beeline [hive@jyoung-hdp234-2 hive]$ beeline -u "jdbc:hive2://jyoung-hdp234-2.openstacklocal:10000/default;principal=hive/_HOST@EXAMPLE.COM" -e "select * from crimenumbers;" +--------------------+--------------------------------+--+ | crimenumbers.year | crimenumbers.number_of_crimes | +--------------------+--------------------------------+--+ | 2008 | 426960 | +--------------------+--------------------------------+--+ 1 row selected (0.286 seconds)

jyoung · ‎12-29-2016

Here I have created a series of demos to highlight workflows that cover examples of using: Hive 2 Action in a Kerberized cluster Jeremy Beard's oozieloop project to simulate looping with sub-workflows Decision nodes to simulate conditional operators (if-then-else) These workflow demo hands-on wikis and source code are hosted on my GitHub repository at: https://github.com/jlyoung/advancedoozieworkflows

Online	Offline
Last Visited	‎06-21-2019 04:30 PM

Member Since	‎07-12-2016 05:18 AM
Last Visited	‎06-21-2019 04:30 PM
Posts	15
Kudos received	11

Cloudera Community

Re: how to track the delete operations in HDFS?

Re: Hbase master crashing after startup

Re: how to track the delete operations in HDFS?

Re: Hbase master crashing after startup

Re: Hbase master crashing after startup

Oozie - Simulating conditional operators (if then ...

Oozie - Simulating looping with sub workflows

Oozie Hive 2 Action in a Kerberized cluster

Advanced Oozie Workflows