First thing you need to do in WFM is create a new Pig action.
Now you can start editing the properties of the action. Since we're going to run a Pig script, let's add that property into the wf.
This is just saying that we're going to execute a script, you still need to add the <file> attribute to the wf.
This expects a file in the workflow directory, we will need to upload a pig script later.
Next, since we're going to run Pig against Hive, we need to provide thrift metastore information for it or include hive-site.xml file into the wf directory, since that usually changes, it's probably best to add the property as part of wf. You can find the property in the Ambari > Hive > Configs and search for hive.metastore.uris.
Now in WFM, you add that into the configuration section of the Pig action.
I also want to compress output coming from Mappers to improve performance for intermediate IO, I'm going to use property of Mapreduce called mapreduce.map.output.compress and set it to true
At this point, I'd like to see how I'm doing and I will preview the workflow in XML form. You can find it under workflow action.
This is also a good time to confirm your thrift URI and commonly forgotten property <script> and <file>.
Now finally, let's add the script to the wf directory. Use your favorite editor and paste the code for Pig and save file as sample.pig
set hcat.bin /usr/bin/hcat;
sql show tables;
A = LOAD 'data' USING org.apache.hive.hcatalog.pig.HCatLoader();
B = LIMIT A 1;
I have a Hive table called 'data' and that's what I'm going to load as part of Pig, I'm going to peek into the table and dump one relation to console. In the 2nd line of the script, I'm also executing a Hive "show tables;" command.
I also recommend to execute this script manually to make sure it works, command for it is
pig -x tez -f sample.pig –useHCatalog
Once it executes, you can see the output on the console, for brevity, I will only show the output we're looking for
2017-02-12 14:29:52,403 [main] INFO org.apache.pig.tools.grunt.GruntParser - Going to run hcat command: show tables;
2017-02-12 14:30:09,205 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
Notice the output of show tables and then (abc,xyz) that's the data I have in my 'data' table.
Finally, upload the file to wf directory. Save the wf first in WFM to create directory or point wf to an existing directory with the script in it.
hdfs dfs -put sample.pig oozie/pig-hcatalog/
We are finally ready to execute the wf. As the last step, we need to tell wf that we're going to use Pig with Hive and HCatalog and we need to add a property oozie.share.lib.for.pig=hive,pig,hcatalog. This property tells Oozie that we need to use more than just Pig libraries to execute the action.
Let's check the status of the wf, click the Dashboard button. Luckily wf succeeded.
Let's click on the job and go to flow graph tab. All nodes appear in green, means it succeeded but we already knew that.
Navigate to action tab, we'll be able to drill to Resource Manager job history from that tab.
Let's click the arrow facing up to continue to RM. Go through the logs in job history and in stdout log you can find the output, we're looking for output of show tables and output of dump command.