Created 01-29-2016 03:32 AM
Hi,
I have simple Pig script. Im trying to load avro file or directory that contains avro file using AvroStorage in Mapreduce mode. I tried almost all the combinations (hdfs://, / , hdfs://ip:port/file ... ) but nothing works.
Using command below
set = load '/spool-dir/CustomerData-20160128-1501807/' USING org.apache.pig.piggybank.storage.avro.AvroStorage ();
I got error:
2016-01-29 00:10:08,439 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error : java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' 2016-01-29 00:10:08,439 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2016-01-29 00:10:08,439 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error : java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' at org.apache.pig.PigServer.openIterator(PigServer.java:925) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
or using command with argument
set = load '/spool-dir/CustomerData-20160128-1501807/' USING org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check');
2016-01-29 00:25:02,767 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error :
java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments
'[no_schema_check]' at org.apache.pig.PigServer.openIterator(PigServer.java:925
My samples are almost identical with the ones on the avrostorage documentation, but i really cant see where the problem is.
The problem is partially described on stackexchange also.
Thank you
Created 02-05-2016 12:56 AM
fyi https://issues.apache.org/jira/browse/PIG-4793
org.apache.pig.piggybank.storage.avro.AvroStorage is Deprecated, use AvroStorage('schema', '-d')
This works.
Created 01-29-2016 04:01 AM
firstly set is a reserved word, change set to another alias, you can also refer to avro simply by AvroStorage no need to write out full package name. If all else fails, add register piggybank.jar command.
Created 01-29-2016 09:50 AM
Hi, sorry the set was my mistypo here:
outSet = load 'hdfs:///CustomerData-20160128-1501807.avro' USING AvroStorage();
This command works, which is ODD, because whats the different when you call it as AvroStorage() or using full package path
org.apache.pig.piggybank.storage.avro.AvroStorage()
Created 01-29-2016 12:26 PM
@John Smith AvroStorage may have different package now though I confirmed with javadoc and I was the same as yours, it may be packaged differently in HDP, classpath may differ, don't know for sure. Please accept this answer.
Created 01-29-2016 09:52 AM
I have another issue with STORE now ....
STORE outputSet INTO 'hdfs:///avro-dest/-CustomerData-20160128-1501807'>> USING AvroStorage('no_schema_check', 'schema', '{"type":"record","name":"xxx","fields":[{"name":"name","type":"string","title":"Customer name","description":"non Surrogate Key for joining files on the BDP"}, ....]}');
error below:
2016-01-29 09:48:42,211 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 20, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'AvroStorage' with arguments '[no_schema_check, schema, {"type":"record",
Created 01-29-2016 10:16 AM
ok so STORE works only with
org.apache.pig.piggybank.storage.avro.AvroStorage(.... )
But there are still issues while trying to write output file
2016-01-29 10:09:28,406 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1454023575813_0018 2016-01-29 10:09:28,406 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases outputSet 2016-01-29 10:09:28,406 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: C: R: outputSet[19,12] 2016-01-29 10:10:03,931 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 2016-01-29 10:10:03,931 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1454023575813_0018 has failed! Stop running all dependent jobs 2016-01-29 10:10:03,931 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2016-01-29 10:10:06,256 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/ 2016-01-29 10:10:06,257 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/10.0.1.47:8050 2016-01-29 10:10:07,417 [main] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/ 2016-01-29 10:10:07,417 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/10.0.1.47:8050 2016-01-29 10:10:07,577 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed! 2016-01-29 10:10:07,585 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
Failed Jobs: JobId Alias Feature Message Outputs job_1454023575813_0018 outputSet DISTINCT Message: Job failed! hdfs:///avro-dest/CustomerData-20160128-1501807, Output(s) Failed to produce result in "hdfs:///avro-dest/CustomerData-20160128-1501807"
Well i really dont understand whats going on here ... no proper documentation, for me random behavior its really hard to use the tool like that.
Created 01-29-2016 10:30 AM
And now it says ... that i cant read data.. both files are there ... even previous run was Successful with reading the source data... Well im so desperate, this is like working with random turing machine. ;-(
How it can fail to read data .... i can easily dump both relations that read data from those input files.
Input(s):
Failed to read data from "hdfs:///CustomerData-20160128-1501807.avro"
Failed to read data from "hdfs:///CustomerData-20160128-1501807.avro"
Output(s): Failed to produce result in "hdfs:///CustomerData-20160128-1501807"
Created 01-29-2016 10:54 AM
Still failing ;-(
Failed Jobs:
JobId Alias Feature Message Outputs
job_1454023575813_0027 outputSet DISTINCT Message: Job failed! /CustomerData-20160128-1501807,
Input(s):
Successfully read 100 records from: "/CustomerData-20160128-1501807-l.avro"
Successfully read 100 records from: "/CustomerData-20160128-1501807-t.avro"
Output(s): Failed to produce result in "/avro-dest/CustomerData-20160128-1501807"
Created 01-29-2016 11:04 AM
here is the full log: log
Created 01-29-2016 12:28 PM
@John Smith I'll review and let you know.