question Re: PigStorage in mapreduce mode in Support Questions

PigStorage in mapreduce mode

lenovomi — Fri, 05 Feb 2016 08:13:18 GMT

Hi,

I am trying to execute pig script in mapreduce mode, script is simple:

grunt> sourceData = load 'hdfs://sandbox.hortonworks.com:8020/src/CustomerData.csv' using PigStorage(';') as (nullname: chararray,customerId: chararray,VIN: chararray,Birthdate: chararray,Mileage: chararray,Fuel_Consumption: chararray);

File is stored in HDFS:

hadoop fs -ls hdfs://sandbox.hortonworks.com:8020/src/CustomerData.csv

-rw-r--r--  3 hdfs hdfs  6828 2016-02-04 23:55 hdfs://sandbox.hortonworks.com:8020/src/CustomerData.csv

Error that i got:

Failed Jobs: JobId Alias Feature Message Outputs job_1454609613558_0003 sourceData MAP_ONLY Message: Job failed! hdfs://sandbox.hortonworks.com:8020/tmp/temp-710368608/tmp-1611282262,

Input(s): Failed to read data from "hdfs://sandbox.hortonworks.com:8020/src/CustomerData.csv"

Output(s): Failed to produce result in "hdfs://sandbox.hortonworks.com:8020/tmp/temp-710368608/tmp-1611282262"

Pig Stack Trace---------------ERROR 1066: Unable to open iterator for alias sourceDataorg.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sourceData  at org.apache.pig.PigServer.openIterator(PigServer.java:935)  at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)  at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)  at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)  at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)  at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)  at org.apache.pig.Main.run(Main.java:565)  at org.apache.pig.Main.main(Main.java:177)  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)  at java.lang.reflect.Method.invoke(Method.java:606)  at org.apache.hadoop.util.RunJar.run(RunJar.java:221)  at org.apache.hadoop.util.RunJar.main(RunJar.java:136)Caused by: java.io.IOException: Job terminated with anomalous status FAILED  at org.apache.pig.PigServer.openIterator(PigServer.java:927)  ... 13 more

Re: PigStorage in mapreduce mode

aervits — Fri, 05 Feb 2016 08:23:06 GMT

@John Smith

How did you launch Pig Grunt?

On sandbox you can use tez

pig -x tez

You can refer to your dataset like so:

'/src/filename.csv' you don't need to explicitly set hdfs scheme. Also, make sure the src directory has permissions for the user you are executing with. Also, last time I looked at your dataset, I thought the delimeter was comma and not semicolon.

Re: PigStorage in mapreduce mode

lenovomi — Fri, 05 Feb 2016 08:39:20 GMT

needles to say, this is insane.

Yes, grunt by -x mapreduce, i tried -x tez but:

2016-02-05 00:37:42,172 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias sourceDataDetails at logfile: /home/hdfs/pig_1454632554431.log

privileges are correct:

drwxr-xr-x   - hdfs   hdfs            0 2016-02-04 23:55 /src

delimiter is is ;

any idea?

Re: PigStorage in mapreduce mode

aervits — Fri, 05 Feb 2016 08:48:24 GMT

@John Smith upload your dataset I will try it out in a bit.

Re: PigStorage in mapreduce mode

lenovomi — Fri, 05 Feb 2016 08:59:15 GMT

you can find dataset here:

https://drive.google.com/file/d/0B6RZ_9vVuTEcTHllU1dIR2VBY1E/view?usp=sharing

\\thank you

Re: PigStorage in mapreduce mode

daijy — Fri, 05 Feb 2016 09:46:01 GMT

I run successfully with your load statement follow by a dump on 2.3 sandbox. What's your complete script?

Re: PigStorage in mapreduce mode

aervits — Fri, 05 Feb 2016 10:49:43 GMT

@John Smith

something is wrong with your environment, I was able to execute your statements in mapred and tez modes

grunt> sourceData = load 'CustomerData.csv' using PigStorage(';') as (nullname: chararray,customerId: chararray,VIN: chararray,Birthdate: chararray,Mileage: chararray,Fuel_Consumption: chararray);
grunt> describe sourceData;
sourceData: {nullname: chararray,customerId: chararray,VIN: chararray,Birthdate: chararray,Mileage: chararray,Fuel_Consumption: chararray}
grunt> b = limit sourceData 5;
grunt> dump b;
2016-02-05 02:43:32,930 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT
2016-02-05 02:43:33,105 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-02-05 02:43:33,106 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2016-02-05 02:43:33,179 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1
2016-02-05 02:43:33,179 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2016-02-05 02:43:33,209 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2016-02-05 02:43:33,256 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2016-02-05 02:43:33,257 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2016-02-05 02:43:33,305 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt__0001_m_000001_1' to hdfs://c6401.ambari.apache.org:8020/tmp/temp2063345867/tmp1865027526/_temporary/0/task__0001_m_000001
2016-02-05 02:43:33,333 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-02-05 02:43:33,336 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2016-02-05 02:43:33,336 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
("Ronni Engelmann",93117643,"WBA68251082969954","1971-11-15",41,10.26)
("Kina Buttars",12452346,"WBA32649710927373","1968-08-14",68,10.551)
("Caren Rodman",18853438,"WBA56064572124841","1987-01-24",96,6.779)
("Tierra Bork",89673290,"WBA69315467645466","1958-11-22",52,10.109)
("Thelma Steve",97170856,"WBA73739033913927","1985-12-03",98,5.081)

and output of illustrate in mapred mode

("Leonel Bullen",50258523,"WBA23530058599244","1986-08-26",27,8.673)
2016-02-05 02:43:47,393 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2016-02-05 02:43:47,393 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2016-02-05 02:43:47,393 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2016-02-05 02:43:47,394 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2016-02-05 02:43:47,401 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2016-02-05 02:43:47,452 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-02-05 02:43:47,459 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: sourceData[4,13] C:  R:
2016-02-05 02:43:47,459 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2016-02-05 02:43:47,470 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2016-02-05 02:43:47,470 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2016-02-05 02:43:47,473 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2016-02-05 02:43:47,473 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2016-02-05 02:43:47,520 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-02-05 02:43:47,536 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: sourceData[4,13] C:  R:
2016-02-05 02:43:47,538 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2016-02-05 02:43:47,542 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2016-02-05 02:43:47,542 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2016-02-05 02:43:47,545 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2016-02-05 02:43:47,545 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2016-02-05 02:43:47,589 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-02-05 02:43:47,606 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: sourceData[4,13] C:  R:
2016-02-05 02:43:47,606 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2016-02-05 02:43:47,612 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2016-02-05 02:43:47,612 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2016-02-05 02:43:47,613 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2016-02-05 02:43:47,613 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2016-02-05 02:43:47,665 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-02-05 02:43:47,668 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: sourceData[4,13] C:  R:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| sourceData     | nullname:chararray     | customerId:chararray     | VIN:chararray       | Birthdate:chararray     | Mileage:chararray     | Fuel_Consumption:chararray     |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                | "Leonel Bullen"        | 50258523                 | "WBA23530058599244" | "1986-08-26"            | 27                    | 8.673                          |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

and in tez mode

grunt> sourceData = load 'CustomerData.csv' using PigStorage(';') as (nullname: chararray,customerId: chararray,VIN: chararray,Birthdate: chararray,Mileage: chararray,Fuel_Consumption: chararray);
grunt> describe sourceData;
sourceData: {nullname: chararray,customerId: chararray,VIN: chararray,Birthdate: chararray,Mileage: chararray,Fuel_Consumption: chararray}
grunt> b = limit sourceData 5;
grunt> dump b;
2016-02-05 02:46:17,619 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT
2016-02-05 02:46:17,698 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2016-02-05 02:46:17,749 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2016-02-05 02:46:18,039 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1
2016-02-05 02:46:18,039 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2016-02-05 02:46:18,143 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2016-02-05 02:46:18,274 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2016-02-05 02:46:18,288 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2016-02-05 02:46:18,711 [main] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt__0001_m_000001_1' to hdfs://c6401.ambari.apache.org:8020/tmp/temp-785652652/tmp136925164/_temporary/0/task__0001_m_000001
2016-02-05 02:46:18,782 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-02-05 02:46:18,811 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2016-02-05 02:46:18,811 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
("Ronni Engelmann",93117643,"WBA68251082969954","1971-11-15",41,10.26)
("Kina Buttars",12452346,"WBA32649710927373","1968-08-14",68,10.551)
("Caren Rodman",18853438,"WBA56064572124841","1987-01-24",96,6.779)
("Tierra Bork",89673290,"WBA69315467645466","1958-11-22",52,10.109)
("Thelma Steve",97170856,"WBA73739033913927","1985-12-03",98,5.081)

there was a bug where illustrate command doesn't work in tez mode yet.

bottom line, I tested it with 'CustomerData.csv' also with '/user/root/CustomerData.csv' and 'hdfs://fqdn:8020/user/root/CustomerData.csv'.

Re: PigStorage in mapreduce mode

daijy — Fri, 05 Feb 2016 13:15:20 GMT

First, the error stack does not tell much. You will need to go to MapReduce WebUI, click the job and find the real error message. Second, your input is a csv file, and you use ; as delimit for PigStorage, that sounds wrong unless you are sure that's the case.

Re: PigStorage in mapreduce mode

lenovomi — Fri, 05 Feb 2016 16:43:47 GMT

then what kind of issue with environment it could be?

I only executed menitoned command, nothing else.

Re: PigStorage in mapreduce mode

lenovomi — Fri, 05 Feb 2016 17:03:59 GMT

this is odd:

when i do

grunt> b = limit sourceData 5;

grunt>dump b;

i works for me also, when i dont limit result set .. .and just executing dump sourceData; im occurring same error.

Re: PigStorage in mapreduce mode

aervits — Fri, 05 Feb 2016 19:38:57 GMT

I think it crashed on me when I dumped the whole dataset, there might be a problem with your dataset further down. @John Smith

Re: PigStorage in mapreduce mode

lenovomi — Fri, 05 Feb 2016 22:45:46 GMT

for 100% there is no problem with input dataset, i kept only first 5 records in file and its the same issue.

Re: PigStorage in mapreduce mode

aervits — Fri, 05 Feb 2016 23:04:20 GMT

@John Smith you got me there, as you see my attempt with your file worked. Alternatively take a look at CSVExcelStorage as that has more capability as opposed to PigStorage. link

I am not saying this is the case, I don't know what's wrong but here's a note, not sure how valid it is anymore as this note has been around for a while and they don't mention which version of Pig they were using

Limitations

PigStorage is an extremely simple loader that does not handle special cases such as embedded delimiters or escaped control characters; it will split on every instance of the delimiter regardless of context. For this reason, when loading a CSV file it is recommended to use CSVExcelStorage rather than PigStorage with a comma delimiter.

Re: PigStorage in mapreduce mode

lenovomi — Sat, 06 Feb 2016 00:01:44 GMT

well CSVExcelStorage doesnt work also....

2016-02-05 16:01:28,917 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2016-02-05 16:01:29,745 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias sourceData Details at logfile: /home/hdfs/pig_1454687855333.log grunt>

Im confused... what is it.

Re: PigStorage in mapreduce mode

aervits — Sat, 06 Feb 2016 00:17:12 GMT

@John Smith if you identified another bug, I'm going to buy a lottery ticket.

Re: PigStorage in mapreduce mode

daijy — Sat, 06 Feb 2016 02:53:54 GMT

As I commented above. I cannot reproduce the error. The error you posted is too general. Can you go to Hadoop Web UI and get the detailed message?

Re: PigStorage in mapreduce mode

lenovomi — Tue, 09 Feb 2016 21:37:50 GMT

its strange you cant reproduce error, does it work for you?

                    Application application_1454923438220_0007 failed 2 
times due to AM Container for appattempt_1454923438220_0007_000002 
exited with  exitCode: 1
                                    
                    For more detailed output, check application tracking
 
page:http://sandbox.hortonworks.com:8088/cluster/app/application_1454923438220_0007Then,
 click on links to logs of each attempt.
                                    
                    Diagnostics: Exception from container-launch.
                                    
                    Container id: container_e10_1454923438220_0007_02_000001
                                    
                    Exit code: 1
                                    
                    Stack trace: ExitCodeException exitCode=1: 
                                    
                    	at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
                                    
                    	at org.apache.hadoop.util.Shell.run(Shell.java:487)
                                    
                    	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
                                    
                    	at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)

                                    
                    	at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)

                                    
                    	at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)

                                    
                    	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
                                    
                    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                                    
                    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                                    
                    	at java.lang.Thread.run(Thread.java:745)
                                    
                  
                  
                    Container exited with a non-zero exit code 1
                                    
                    Failing this attempt. Failing the application.

Re: PigStorage in mapreduce mode

shalini_goel — Sat, 01 Apr 2017 08:40:08 GMT

I got the same issue in hortonworks sandbox environment. Script was correct but was throwing this error

Unable to open iterator foralias

I found Jobhistory server was not working by default. I could not relate the connection between the two but after starting histoyserver , my pig script worked in both tez and mapreduce mode. Try it if it works for yoou as well.

[mapred@sandbox ~]$ cd /usr/hdp/current/hadoop-mapreduce-historyserver/sbin
[mapred@sandbox sbin]$ ls 
mr-jobhistory-daemon.sh
[mapred@sandbox sbin]$ mr-jobhistory-daemon.sh start historyserver