Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Data Munging with Hadoop DataFu Sampling example getting error

avatar
Master Mentor

it may be my data that causes the problem as I had to create my own table with the downloaded dataset so I used my own discretion. When I run the following code

DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.2');
ROWS = load 'medicare_part_b.medicare_part_b_2013_raw' using HCatLoader();
SAMPLE_BY_PROVIDERS = filter ROWS by SampleByKey(npi);
rmf medicare_part_b/ex2_by_npi_sample;
STORE SAMPLE_BY_PROVIDERS into 'medicare_part_b/ex2_by_npi_sample' using PigStorage(',');

I get the following error

2016-02-26 02:07:58,053 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS
Details at logfile: /root/pig_1456451995224.log

the log file shows this

Pig Stack Trace
---------------
ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS


org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1694)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:623)
        at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1082)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
        at org.apache.pig.Main.run(Main.java:565)
        at org.apache.pig.Main.main(Main.java:177)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: java.lang.NullPointerException
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:310)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
        at org.apache.pig.PigServer.execute(PigServer.java:1364)
        at org.apache.pig.PigServer.access$500(PigServer.java:113)
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1689)
        ... 14 more
Caused by: java.lang.NullPointerException
Caused by: java.lang.NullPointerException
        at datafu.pig.sampling.SampleByKey.setUDFContextSignature(SampleByKey.java:86)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.setSignature(POUserFunc.java:611)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.instantiateFunc(POUserFunc.java:125)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.<init>(POUserFunc.java:120)
        at org.apache.pig.newplan.logical.expression.ExpToPhyTranslationVisitor.visit(ExpToPhyTranslationVisitor.java:505)
        at org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:113)
        at org.apache.pig.newplan.ReverseDependencyOrderWalkerWOSeenChk.walk(ReverseDependencyOrderWalkerWOSeenChk.java:69)
        at org.apache.pig.newplan.logical.relational.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:223)
        at org.apache.pig.newplan.logical.relational.LOFilter.accept(LOFilter.java:79)
        at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:260)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:295)
        ... 19 more

does not work in TEZ or MAPREDUCE mode

scheme for the relation SAMPLE_BY_PROVIDERS

grunt> DESCRIBE SAMPLE_BY_PROVIDERS;
SAMPLE_BY_PROVIDERS: {npi: chararray,nppes_provider_last_org_name: chararray,nppes_provider_first_name: chararray,nppes_provider_mi: chararray,nppes_credentials: chararray,nppes_provider_gender: chararray,nppes_entity_code: chararray,nppes_provider_street1: chararray,nppes_provider_street2: chararray,nppes_provider_city: chararray,nppes_provider_zip: chararray,nppes_provider_state: chararray,nppes_provider_country: chararray,provider_type: chararray,medicare_participation_indicator: chararray,places_of_service: chararray,hcpcs_code: chararray,hcpcs_desc: chararray,hcpcs_drug_indicator: chararray,line_srvc_cnt: int,bene_unique_cnt: int,bene_day_srvc_cnt: int,average_medicare_all_owed_amt: chararray,average_submitted_chrg_amt: chararray,stdev_submitted_chrg_amt: chararray,average_medicare_payment_amt: chararray,stdev_medicare_payment_amt: chararray}
1 ACCEPTED SOLUTION

avatar
Contributor

Hi there, you've run into https://issues.apache.org/jira/browse/DATAFU-68 To fix this, you'll need to use datafu v 1.3. Which version of datafu are you running?

View solution in original post

5 REPLIES 5

avatar
Master Mentor

@Ofer Mendelevith @scasey any insight?

avatar
Contributor

Hi there, you've run into https://issues.apache.org/jira/browse/DATAFU-68 To fix this, you'll need to use datafu v 1.3. Which version of datafu are you running?

avatar
Master Mentor

Thanks for looking into this Casey, was just talking to Ofer yesrderday about this. I ran this on sandbox 2.3.2. My steps are documented here https://github.com/dbist/datamunging

Would really appreciate some feedback. Also tablesample with percentage query does not work on small datasets. I documented a workaround in my readme.

avatar
Master Mentor

@cstella I just confirmed that it still fails with DataFu 1.3 on latest Sandbox 2.4 v3.

avatar
Master Mentor

on Sandbox 2.5, Datafu is indeed 1.3, validated the function albeit with different dataset

DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.2');
ROWS = load 'sample_08' using HCatLoader();
SAMPLE_BY_total_emp = filter ROWS by SampleByKey(total_emp);
STORE SAMPLE_BY_total_emp into 'sample_total_emp';
[guest@sandbox ~]$ hdfs dfs -cat sample_total_emp/part-v000-o000-r-00000 | head -n 5
11-3011	Administrative services managers	246930	79500
11-9121	Natural sciences managers	43060	123140
13-1032	Insurance appraisers, auto damage	11280	53980
13-1051	Cost estimators	218400	60320
13-1072	Compensation, benefits, and job analysis specialists	116250	57060