Created 01-29-2016 03:32 AM
Hi,
I have simple Pig script. Im trying to load avro file or directory that contains avro file using AvroStorage in Mapreduce mode. I tried almost all the combinations (hdfs://, / , hdfs://ip:port/file ... ) but nothing works.
Using command below
set = load '/spool-dir/CustomerData-20160128-1501807/' USING org.apache.pig.piggybank.storage.avro.AvroStorage ();
I got error:
2016-01-29 00:10:08,439 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error : java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' 2016-01-29 00:10:08,439 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2016-01-29 00:10:08,439 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error : java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' at org.apache.pig.PigServer.openIterator(PigServer.java:925) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
or using command with argument
set = load '/spool-dir/CustomerData-20160128-1501807/' USING org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check');
2016-01-29 00:25:02,767 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error :
java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments
'[no_schema_check]' at org.apache.pig.PigServer.openIterator(PigServer.java:925
My samples are almost identical with the ones on the avrostorage documentation, but i really cant see where the problem is.
The problem is partially described on stackexchange also.
Thank you
Created 02-05-2016 12:56 AM
fyi https://issues.apache.org/jira/browse/PIG-4793
org.apache.pig.piggybank.storage.avro.AvroStorage is Deprecated, use AvroStorage('schema', '-d')
This works.
Created 01-29-2016 03:11 PM
thats ok 😉
Created 01-29-2016 03:09 PM
thats strange... it works for me.
grunt> sensitiveSet = load '/t-spool-dir/Test-20160129-1401822-ttp.avro' USING AvroStorage(); grunt> nonSensSet = load '/d-spool-dir/Test-20160129-1401822-lake.avro' USING AvroStorage(); grunt> outputSet = join sensitiveSet by Row_ID, nonSensSet by Row_ID;grunt> outputSet = distinct outputSet; grunt> outputSet = foreach outputSet generate nonSensSet::name,nonSensSet::customerId,sensitiveSet::VIN,sensitiveSet::Birthdate,nonSensSet::Mileage,nonSensSet::Fuel_Consumption;grunt> dump outputSet;
("Kina Buttars",12452346,"WBA32649710927373","1968-08-14",68,10.551)
("Caren Rodman",18853438,"WBA56064572124841","1987-01-24",96,6.779)
("Tierra Bork",89673290,"WBA69315467645466","1958-11-22",52,10.109)
("Thelma Steve",97170856,"WBA73739033913927","1985-12-03",98,5.081)
.....
Created 01-29-2016 03:31 PM
your issue is with some reserved word in avro schema. Here's what I'm getting
grunt> nonSensSet = load '/user/root/Test-20160129-1401822-lake.avro' USING AvroStorage(); grunt> sensitiveSet = load '/user/root/Test-20160129-1401822-ttp.avro' using AvroStorage(); grunt> outputSet = join sensitiveSet by Row_ID, nonSensSet by Row_ID; grunt> outputSet = distinct outputSet; grunt> outputSet = foreach outputSet generate nonSensSet::name,nonSensSet::customerId,nonSensSet::Mileage,nonSensSet::Fuel_Consumption,sensitiveSet::VIN,sensitiveSet::Birthdate; grunt> store outputSet into 'avrostorage' using AvroStorage(); 2016-01-29 15:27:00,682 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2116: <line 6, column 0> Output Location Validation Failed for: 'hdfs://sandbox.hortonworks.com:8020/user/root/avrostorage More info to follow: Pig Schema contains a name that is not allowed in Avro Details at logfile: /root/pig-upload/pig_1454081182813.log
I saved the outputSet successfully as PigStorage(','); so I can't comment what the issue is. Something intricate about Avro.
Created 01-29-2016 03:38 PM
@John Smith I just read the AvroStorage wiki, they do say they have limited support for union schemas and record types, I guess the only thing I can comment on is that AvroStorage is limited in its functionality. Perhaps you'd want to look at other Storage Formats.
Created 01-29-2016 03:49 PM
Created 01-29-2016 03:56 PM
@John Smith like I said, I tried with PigStorage and it worked fine, take a look at OrcStorage, which is pretty good columnar format for Pig, Hive and Spark (meaning you can query the same table from either tool natively), there are many formats, I can't recommend anything unless we know your use case. I do like Avro but sometimes it's driving me insane :). Try looking at the schemas, you can probably still get it working, I just don't have time to look at it. If you do find a solution, post here so we could all learn!
Created 01-29-2016 03:49 PM
is there anything important in
Details at logfile: /root/pig-upload/pig_1454081182813.log
Created 01-29-2016 03:52 PM
same error as I pasted. @John Smith
Created 01-29-2016 04:10 PM
/** * Translates a name in a pig schema to an acceptable Avro name, or * throws an error if the name can't be translated. * @param name The variable name to translate. * @param doubleColonsToDoubleUnderscores Indicates whether to translate * double colons to underscores or throw an error if they are encountered. * @return A name usable by Avro. * @throws IOException If the name is not compatible with Avro. */ private static String toAvroName(String name, final Boolean doubleColonsToDoubleUnderscores) throws IOException { if (name == null) { return null; } if (doubleColonsToDoubleUnderscores) { name = name.replace("::", "__"); } if (name.matches("[A-Za-z_][A-Za-z0-9_]*")) { return name; } else { throw new IOException( "Pig Schema contains a name that is not allowed in Avro"); } }
This is the check, and i dont have any characters <>
A-Za-z_][A-Za-z0-9_
defined as part of the schema in pig.
Btw i dont know why but everything i paste here some CODE/ and click to formate it into code its completely messed up, all newlines are removed... .
Created 01-29-2016 04:17 PM
@John Smith excellent, you went to the source code. It's actually [A-Za-z_][A-Za-z0-9_]* so plus asterisc. So if you did check and got no results of it, perhaps you discovered a bug? Once you're 100% sure, I suggest you file a Jira with Pig project.