Created 07-07-2017 06:28 AM
I am trying to follow the tutorial-100 for apache pig. When I run the script, in the results tab I do not see the output of the script and it is very hard to understand what the script is doing.
In the results I get the below
pache Pig version 0.16.0.2.6.0.3-8 (rexported) compiled Apr 01 2017, 21:50:35 USAGE: Pig [options] [-] : Run interactively in grunt shell. Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s). Pig [options] [-f[ile]] file : Run cmds found in file. options include: -4, -log4jconf - Log4j configuration file, overrides log conf -b, -brief - Brief logging (no timestamps) -c, -check - Syntax check -d, -debug - Debug level, INFO is default -e, -execute - Commands to execute (within quotes) -f, -file - Path to the script to execute -g, -embedded - ScriptEngine classname or keyword for the ScriptEngine -h, -help - Display this message. You can specify topic to get help for that topic. properties is the only topic currently supported: -h properties. -i, -version - Display version information -l, -logfile - Path to client side log file; default is current working directory. -m, -param_file - Path to the parameter file -p, -param - Key value pair of the form param=val -r, -dryrun - Produces script with substituted parameters. Script is not executed. -t, -optimizer_off - Turn optimizations off. The following values are supported: ConstantCalculator - Calculate constants at compile time SplitFilter - Split filter conditions PushUpFilter - Filter as early as possible MergeFilter - Merge filter conditions PushDownForeachFlatten - Join or explode as late as possible LimitOptimizer - Limit as early as possible ColumnMapKeyPrune - Remove unused data AddForEach - Add ForEach to remove unneeded columns MergeForEach - Merge adjacent ForEach GroupByConstParallelSetter - Force parallel 1 for "group all" statement PartitionFilterOptimizer - Pushdown partition filter conditions to loader implementing LoadMetaData PredicatePushdownOptimizer - Pushdown filter predicates to loader implementing LoadPredicatePushDown All - Disable all optimizations All optimizations listed here are enabled by default. Optimization values are case insensitive. -v, -verbose - Print all error messages to screen -w, -warning - Turn warning logging on; also turns warning aggregation off -x, -exectype - Set execution mode: local|mapreduce|tez, default is mapreduce. -F, -stop_on_failure - Aborts execution on the first failed job; default is off -M, -no_multiquery - Turn multiquery optimization off; default is on -N, -no_fetch - Turn fetch optimization off; default is on -P, -propertyFile - Path to property file -printCmdDebug - Overrides anything else and prints the actual command used to run Pig, including any environment variables that are set by the pig command.
and under the log, I see this
WARNING: Use "yarn jar" to launch YARN applications.
17/07/07 06:16:36 INFO pig.Main: Pig script completed in 196 milliseconds (196 ms)
The script I am running is below. Please advise if the output in the results is normal. If it is normal, how can I see what the output of the script at each step. Thanks
a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();b = FILTER a BY event != 'normal'; c = FOREACH b GENERATE driverid, event, (int)1 as occurance; d = GROUP c BY driverid; e = FOREACH d GENERATE group as driverid, sum(c.occurance) as t_occ; g = LOAD 'driver_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader(); h = join e by driverid,g by driverid; dump h;
Created 07-07-2017 10:27 PM
Created 07-10-2017 01:26 AM
The "yarn jar" warning is nothing to worry about and the output you received suggests that you were unable to launch the script. My guess is your command-line interaction was incorrect. It should have been something like the following.
pig -useHCatalog yourscript.pig
You can see some examples of this at https://martin.atlassian.net/wiki/x/AgCfB (including running via Tez). If you are doing this, please show the exact command your ran. If running from the Ambari View, be sure to add the -useHCatalog argument as shown in Step 5.4 of https://hortonworks.com/hadoop-tutorial/how-to-use-hcatalog-basic-pig-hive-commands/.
Created 07-10-2017 07:16 AM
@Lester Martin I am running the script from Ambari and I have -useHCatalog argument added. But when I run the below script
a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader(); b = FILTER a BY event != 'normal'; dump b;
Instead of getting the output of the script, I get what you can see in the attached txt file. I want to know if this is normal and how can I see what the script is doing. thanks
Created 07-10-2017 07:05 PM
Gotcha; using the Ambari View. It still seems that it is not getting invoked properly. Can you provide a screenshot of the Ambari View; especially the section with the -useHCatalog argument? Did you try it with, and without, the "use Tez" checkbox selected? While this code looks good, it is often a good idea to try the code out from the CLI just to remove one less variable (again, the code looks simple and direct enough that I don't think this would provide much value other than showing you it can run).
Created 07-11-2017 04:22 AM
Thanks @Lester Martin. I removed -useHcatalog and readded it and now it seems to display the results of the script. It was very hard to learn without knowing the output of the script at each step.