Created 02-26-2016 02:12 AM
it may be my data that causes the problem as I had to create my own table with the downloaded dataset so I used my own discretion. When I run the following code
DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader(); DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.2'); ROWS = load 'medicare_part_b.medicare_part_b_2013_raw' using HCatLoader(); SAMPLE_BY_PROVIDERS = filter ROWS by SampleByKey(npi); rmf medicare_part_b/ex2_by_npi_sample; STORE SAMPLE_BY_PROVIDERS into 'medicare_part_b/ex2_by_npi_sample' using PigStorage(',');
I get the following error
2016-02-26 02:07:58,053 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS Details at logfile: /root/pig_1456451995224.log
the log file shows this
Pig Stack Trace --------------- ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1694) at org.apache.pig.PigServer.registerQuery(PigServer.java:623) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1082) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66) at org.apache.pig.Main.run(Main.java:565) at org.apache.pig.Main.main(Main.java:177) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:310) at org.apache.pig.PigServer.launchPlan(PigServer.java:1390) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375) at org.apache.pig.PigServer.execute(PigServer.java:1364) at org.apache.pig.PigServer.access$500(PigServer.java:113) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1689) ... 14 more Caused by: java.lang.NullPointerException Caused by: java.lang.NullPointerException at datafu.pig.sampling.SampleByKey.setUDFContextSignature(SampleByKey.java:86) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.setSignature(POUserFunc.java:611) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.instantiateFunc(POUserFunc.java:125) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.<init>(POUserFunc.java:120) at org.apache.pig.newplan.logical.expression.ExpToPhyTranslationVisitor.visit(ExpToPhyTranslationVisitor.java:505) at org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:113) at org.apache.pig.newplan.ReverseDependencyOrderWalkerWOSeenChk.walk(ReverseDependencyOrderWalkerWOSeenChk.java:69) at org.apache.pig.newplan.logical.relational.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:223) at org.apache.pig.newplan.logical.relational.LOFilter.accept(LOFilter.java:79) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:260) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:295) ... 19 more
does not work in TEZ or MAPREDUCE mode
scheme for the relation SAMPLE_BY_PROVIDERS
grunt> DESCRIBE SAMPLE_BY_PROVIDERS; SAMPLE_BY_PROVIDERS: {npi: chararray,nppes_provider_last_org_name: chararray,nppes_provider_first_name: chararray,nppes_provider_mi: chararray,nppes_credentials: chararray,nppes_provider_gender: chararray,nppes_entity_code: chararray,nppes_provider_street1: chararray,nppes_provider_street2: chararray,nppes_provider_city: chararray,nppes_provider_zip: chararray,nppes_provider_state: chararray,nppes_provider_country: chararray,provider_type: chararray,medicare_participation_indicator: chararray,places_of_service: chararray,hcpcs_code: chararray,hcpcs_desc: chararray,hcpcs_drug_indicator: chararray,line_srvc_cnt: int,bene_unique_cnt: int,bene_day_srvc_cnt: int,average_medicare_all_owed_amt: chararray,average_submitted_chrg_amt: chararray,stdev_submitted_chrg_amt: chararray,average_medicare_payment_amt: chararray,stdev_medicare_payment_amt: chararray}
Created 04-07-2016 05:53 PM
Hi there, you've run into https://issues.apache.org/jira/browse/DATAFU-68 To fix this, you'll need to use datafu v 1.3. Which version of datafu are you running?
Created 02-26-2016 02:13 AM
@Ofer Mendelevith @scasey any insight?
Created 04-07-2016 05:53 PM
Hi there, you've run into https://issues.apache.org/jira/browse/DATAFU-68 To fix this, you'll need to use datafu v 1.3. Which version of datafu are you running?
Created 04-07-2016 06:01 PM
Thanks for looking into this Casey, was just talking to Ofer yesrderday about this. I ran this on sandbox 2.3.2. My steps are documented here https://github.com/dbist/datamunging
Would really appreciate some feedback. Also tablesample with percentage query does not work on small datasets. I documented a workaround in my readme.
Created 04-21-2016 01:26 AM
@cstella I just confirmed that it still fails with DataFu 1.3 on latest Sandbox 2.4 v3.
Created 07-26-2016 04:19 PM
on Sandbox 2.5, Datafu is indeed 1.3, validated the function albeit with different dataset
DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader(); DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.2'); ROWS = load 'sample_08' using HCatLoader(); SAMPLE_BY_total_emp = filter ROWS by SampleByKey(total_emp); STORE SAMPLE_BY_total_emp into 'sample_total_emp';
[guest@sandbox ~]$ hdfs dfs -cat sample_total_emp/part-v000-o000-r-00000 | head -n 5 11-3011 Administrative services managers 246930 79500 11-9121 Natural sciences managers 43060 123140 13-1032 Insurance appraisers, auto damage 11280 53980 13-1051 Cost estimators 218400 60320 13-1072 Compensation, benefits, and job analysis specialists 116250 57060