Created 01-29-2016 03:32 AM
Hi,
I have simple Pig script. Im trying to load avro file or directory that contains avro file using AvroStorage in Mapreduce mode. I tried almost all the combinations (hdfs://, / , hdfs://ip:port/file ... ) but nothing works.
Using command below
set = load '/spool-dir/CustomerData-20160128-1501807/' USING org.apache.pig.piggybank.storage.avro.AvroStorage ();
I got error:
2016-01-29 00:10:08,439 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error : java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' 2016-01-29 00:10:08,439 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2016-01-29 00:10:08,439 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error : java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' at org.apache.pig.PigServer.openIterator(PigServer.java:925) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
or using command with argument
set = load '/spool-dir/CustomerData-20160128-1501807/' USING org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check');
2016-01-29 00:25:02,767 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error :
java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments
'[no_schema_check]' at org.apache.pig.PigServer.openIterator(PigServer.java:925
My samples are almost identical with the ones on the avrostorage documentation, but i really cant see where the problem is.
The problem is partially described on stackexchange also.
Thank you
Created 02-05-2016 12:56 AM
fyi https://issues.apache.org/jira/browse/PIG-4793
org.apache.pig.piggybank.storage.avro.AvroStorage is Deprecated, use AvroStorage('schema', '-d')
This works.
Created 01-29-2016 06:12 PM
@John Smith yes I followed your efforts, can you live with the work around? I honestly don't have cycles to investigate further for you. I think I learned something myself thanks to you :). I'd say open an article on HCC with the proposed workaround and your desired goal, maybe someone else can weigh in. Great job John!
Created 01-29-2016 06:54 PM
well i cant live with that workaround, thats the problem. what i HCC?
Created 01-29-2016 06:56 PM
this website is called Hortonworks Community Connection, HCC for short. Again, post this as a separate issue with tags for Avro and Pig. @John Smith
Created 01-31-2016 02:26 PM
is there any update on this?
Created 02-01-2016 10:34 AM
one more important observation, when i dump data into avro using
store outputSet into 'avrostorage' using AvroStorage();
the schema inside avro file looks like:
{"type":"record","name":"pig_output","fields":[{"name":"name","type":["null","string"]},{"name":"customerId","type":["null","string"]},{"name":"VIN","type":["null","string"]},{"name":"Birthdate","type":["null","string"]},{"name":"Mileage","type":["null","string"]},{"name":"Fuel_Consumption","type":["null","string"]}]}
Why each field contains null?
Created 02-01-2016 11:26 AM
@John Smith it means the field can be null if missing, an optional field that is. That way if you don't pass a field it won't complain.
Created 02-01-2016 12:20 PM
sure but input data contains all the field, so my question is why it generates [null] as part of the datatype.
Also still no luck with
Created 02-01-2016 12:22 PM
@John Smith read avro docs for explanation of optionalvs default fields.
Created 02-01-2016 12:27 PM
ah i already did ... my question was why its there ... when i use local mode its not there .. anyway there is no reply from anyone behind avrostorage... thats pretty odd.
Created 02-01-2016 12:38 PM
@John Smith its a better practice so that if you do happen to get a null at least it won't bomb. As far as jira, that's open source, individual contributors also need earn a living and if there's higher responsibilities then they'll get to it when queue is clear. I wouldn't get your hopes up and identify alternative ways. Shoot an email to the avro mailing list. They may help faster.