Created 01-29-2016 03:32 AM
Hi,
I have simple Pig script. Im trying to load avro file or directory that contains avro file using AvroStorage in Mapreduce mode. I tried almost all the combinations (hdfs://, / , hdfs://ip:port/file ... ) but nothing works.
Using command below
set = load '/spool-dir/CustomerData-20160128-1501807/' USING org.apache.pig.piggybank.storage.avro.AvroStorage ();
I got error:
2016-01-29 00:10:08,439 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error : java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' 2016-01-29 00:10:08,439 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2016-01-29 00:10:08,439 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error : java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' at org.apache.pig.PigServer.openIterator(PigServer.java:925) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
or using command with argument
set = load '/spool-dir/CustomerData-20160128-1501807/' USING org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check');
2016-01-29 00:25:02,767 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sensitiveSet. Backend error :
java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments
'[no_schema_check]' at org.apache.pig.PigServer.openIterator(PigServer.java:925
My samples are almost identical with the ones on the avrostorage documentation, but i really cant see where the problem is.
The problem is partially described on stackexchange also.
Thank you
Created 02-05-2016 12:56 AM
fyi https://issues.apache.org/jira/browse/PIG-4793
org.apache.pig.piggybank.storage.avro.AvroStorage is Deprecated, use AvroStorage('schema', '-d')
This works.
Created 01-29-2016 04:13 PM
this is the schema:
outputSet: {nonSensSet::name: chararray,nonSensSet::customerId: chararray,sensitiveSet::VIN: chararray,sensitiveSet::Birthdate: chararray,nonSensSet::Mileage: chararray,nonSensSet::Fuel_Consumption: chararray}
Created 01-29-2016 04:17 PM
yes I got the same. One more thing to try is store as PigStorage, then in another script load that dataset using Pigstorage and store as Avro. I'm wondering if that will work. @John Smith
Created 01-29-2016 04:49 PM
i did this:
outputSet = foreach outputSet generate $0 as (name:chararray) , $1 as (customerId:chararray), $2 as (VIN:chararray) , $3 as (Birthdate:chararray), $4 as (Mileage:chararray) ,$5 as (Fuel_Consumption:chararray);
and command below worked
store outputSet into 'avrostorage' using AvroStorage();
Output(s):
Successfully stored 100 records in: "file:///root/deploy-3/avrostorage"
thats strange, Apparently there is an issue when the relation was describtion as :
grunt> describe outputSet; outputSet: {nonSensSet::name: chararray,nonSensSet::customerId: chararray,sensitiveSet::VIN: chararray,sensitiveSet::Birthdate: chararray,nonSensSet::Mileage: chararray,nonSensSet::Fuel_Consumption: chararray}
but
/AvroStorageSchemaConversionUtilities.java contains code :
if (doubleColonsToDoubleUnderscores) {
name = name.replace("::", "__");
}
There is still the same problem when i try to store using AvroStorage from the script provided:
Output(s): Failed to produce result in "/avro-dest/Test-20160129-1401822"
Created 01-29-2016 04:54 PM
@John Smith time to file a jira, great job investigating this.
Created 01-29-2016 05:05 PM
could you please add that line before STORE
outputSet = foreach outputSet generate $0 as (name:chararray) , $1 as (customerId:chararray), $2 as (VIN:chararray) , $3 as (Birthdate:chararray), $4 as (Mileage:chararray) ,$5 as (Fuel_Consumption:chararray);
and execute my pig script in your environment?
Created 01-29-2016 05:06 PM
I dont know what happened but i cant load any avro file in mapreduce mode ...
grunt> sensitiveSet = load '/t-spool-dir/Test-20160129-1401822-ttp.avro' USING AvroStorage(); 2016-01-29 17:06:00,668 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: null Details at logfile: /tmp/hsperfdata_hdfs/pig_1454087102249.log
Pig Stack Trace --------------- ERROR 1200: null
Failed to parse: null
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:201)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1707)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1680)
at org.apache.pig.PigServer.registerQuery(PigServer.java:623)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1082)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:565)
at org.apache.pig.Main.main(Main.java:177)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.NullPointerException
at org.apache.pig.builtin.AvroStorage.getAvroSchema(AvroStorage.java:298)
at org.apache.pig.builtin.AvroStorage.getAvroSchema(AvroStorage.java:282)
at org.apache.pig.builtin.AvroStorage.getSchema(AvroStorage.java:256)
at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:89)
at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:901)
at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
... 16 more
================================================================================
/tmp/hsperfdata_hdfs/pig_1454087102249.log (END)
Created 01-29-2016 05:10 PM
ops sorry my fault ... i dont have that source stored in HDFS ... time to stop debugging for today -)
Created 01-29-2016 05:38 PM
Ok, i added the line
outputSet = foreach outputSet generate $0 as (name:chararray) , $1 as (customerId:chararray), $2 as (VIN:chararray) , $3 as (Birthdate:chararray), $4 as (Mileage:chararray) ,$5 as (Fuel_Consumption:chararray);
and successfully created output avro file using:
store outputSet into 'avrostorage' using AvroStorage();
When i try to store output file using code below it is failing
/10.0.1.47:8050 2016-01-29 17:24:39,600 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
at this point i clearly have no idea what else i can do.
STORE outputSet INTO '/avro-dest/Test-20160129-1401822' USING org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check', 'schema', '{"type":"record","name":"test","fields":[{"name":"name","type":"string","title":"Customer name","description":"non Surrogate Key for joining files on the BDP","DataOwner":"Bank","ValidityDate":"2015.12.22","ValidityOption":"Delete","DataSensitivityLevel":"0","FieldPosition":"1"},{"name":"customerId","type":"string","title":"customer Id","description":"non sensitive field of customer Id","DataOwner":"Bank","ValidityDate":"2015.12.22","ValidityOption":"Retain","DataSensitivityLevel":"0","FieldPosition":"2"},{"name":"VIN","type":"string","title":"Customer VIN","description":"Customer VIN","DataOwner":"Bank","ValidityDate":"2015.12.22","ValidityOption":"Delete","DataSensitivityLevel":"1","FieldPosition":"3"},{"name":"Birthdate","type":"string","title":"Customer birthdate","description":"Customer birthdate","DataOwner":"Bank","ValidityDate":"2015.12.22","ValidityOption":"Delete","DataSensitivityLevel":"1","FieldPosition":"4"},{"name":"Mileage","type":"string","title":"Customer mileage","description":"Customer mileage","DataOwner":"Bank","ValidityDate":"2015.12.22","ValidityOption":"Delete","DataSensitivityLevel":"0","FieldPosition":"5"},{"name":"Fuel_Consumption","type":"string","title":"Customer fule consumption","description":"Customer fuel consumption","DataOwner":"Bank","ValidityDate":"2015.12.22","ValidityOption":"Delete","DataSensitivityLevel":"0","FieldPosition":"6"}]}');
Created 01-29-2016 06:06 PM
@John Smith the code work-around works, I was running in tez mode by the way.
outputSet = foreach outputSet generate $0 as (name:chararray) , $1 as (customerId:chararray), $2 as (VIN:chararray) , $3 as (Birthdate:chararray), $4 as (Mileage:chararray) ,$5 as (Fuel_Consumption:chararray); store outputSet into 'avroout2' using AvroStorage();
Input(s): Successfully read 100 records (15099 bytes) from: "/user/root/Test-20160129-1401822-lake.avro" Successfully read 100 records (12703 bytes) from: "/user/root/Test-20160129-1401822-ttp.avro" Output(s): Successfully stored 100 records (7703 bytes) in: "hdfs://sandbox.hortonworks.com:8020/user/root/avroout2" grunt> 2016-01-29 18:04:19,978 [main] INFO org.apache.pig.Main - Pig script completed in 1 minute, 52 seconds and 249 milliseconds (112249 ms) 2016-01-29 18:04:19,978 [main] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - Shutting down thread pool 2016-01-29 18:04:20,008 [Thread-1] ERROR org.apache.pig.impl.io.FileLocalizer - java.io.IOException: Filesystem closed 2016-01-29 18:04:20,025 [Thread-23] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager - Shutting down Tez session org.apache.tez.client.TezClient@2c8b16b6 2016-01-29 18:04:20,025 [Thread-23] INFO org.apache.tez.client.TezClient - Shutting down Tez Session, sessionName=PigLatin:DefaultJobName, applicationId=application_1454090472993_0001 [root@sandbox pig-upload]# hdfs dfs -ls avroout2 Found 2 items -rw-r--r-- 3 root hdfs 0 2016-01-29 18:03 avroout2/_SUCCESS -rw-r--r-- 3 root hdfs 7703 2016-01-29 18:03 avroout2/part-v003-o000-r-00000.avro [root@sandbox pig-upload]# hdfs dfs -cat avroout2/part-v003-o000-r-00000.avro | less
Created 01-29-2016 06:10 PM
yes, works for me also, but when i use
STORE outputSet INTO '/avro-dest/Test-20160129-1401822' USING org.apache.pig.piggybank.storage.avro.AvroStorage
and i define schema as part of the AvroStorage( schema ) ... it doesnt work ;-(((