Support Questions
Find answers, ask questions, and share your expertise

Data Munging with Hadoop DataFu Sampling example getting error

Mentor

it may be my data that causes the problem as I had to create my own table with the downloaded dataset so I used my own discretion. When I run the following code

DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.2');
ROWS = load 'medicare_part_b.medicare_part_b_2013_raw' using HCatLoader();
SAMPLE_BY_PROVIDERS = filter ROWS by SampleByKey(npi);
rmf medicare_part_b/ex2_by_npi_sample;
STORE SAMPLE_BY_PROVIDERS into 'medicare_part_b/ex2_by_npi_sample' using PigStorage(',');

I get the following error

2016-02-26 02:07:58,053 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS
Details at logfile: /root/pig_1456451995224.log

the log file shows this

Pig Stack Trace
---------------
ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS


org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1694)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:623)
        at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1082)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
        at org.apache.pig.Main.run(Main.java:565)
        at org.apache.pig.Main.main(Main.java:177)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: java.lang.NullPointerException
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:310)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
        at org.apache.pig.PigServer.execute(PigServer.java:1364)
        at org.apache.pig.PigServer.access$500(PigServer.java:113)
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1689)
        ... 14 more
Caused by: java.lang.NullPointerException
Caused by: java.lang.NullPointerException
        at datafu.pig.sampling.SampleByKey.setUDFContextSignature(SampleByKey.java:86)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.setSignature(POUserFunc.java:611)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.instantiateFunc(POUserFunc.java:125)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.<init>(POUserFunc.java:120)
        at org.apache.pig.newplan.logical.expression.ExpToPhyTranslationVisitor.visit(ExpToPhyTranslationVisitor.java:505)
        at org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:113)
        at org.apache.pig.newplan.ReverseDependencyOrderWalkerWOSeenChk.walk(ReverseDependencyOrderWalkerWOSeenChk.java:69)
        at org.apache.pig.newplan.logical.relational.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:223)
        at org.apache.pig.newplan.logical.relational.LOFilter.accept(LOFilter.java:79)
        at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:260)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:295)
        ... 19 more

does not work in TEZ or MAPREDUCE mode

scheme for the relation SAMPLE_BY_PROVIDERS

grunt> DESCRIBE SAMPLE_BY_PROVIDERS;
SAMPLE_BY_PROVIDERS: {npi: chararray,nppes_provider_last_org_name: chararray,nppes_provider_first_name: chararray,nppes_provider_mi: chararray,nppes_credentials: chararray,nppes_provider_gender: chararray,nppes_entity_code: chararray,nppes_provider_street1: chararray,nppes_provider_street2: chararray,nppes_provider_city: chararray,nppes_provider_zip: chararray,nppes_provider_state: chararray,nppes_provider_country: chararray,provider_type: chararray,medicare_participation_indicator: chararray,places_of_service: chararray,hcpcs_code: chararray,hcpcs_desc: chararray,hcpcs_drug_indicator: chararray,line_srvc_cnt: int,bene_unique_cnt: int,bene_day_srvc_cnt: int,average_medicare_all_owed_amt: chararray,average_submitted_chrg_amt: chararray,stdev_submitted_chrg_amt: chararray,average_medicare_payment_amt: chararray,stdev_medicare_payment_amt: chararray}
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Data Munging with Hadoop DataFu Sampling example getting error

Explorer

Hi there, you've run into https://issues.apache.org/jira/browse/DATAFU-68 To fix this, you'll need to use datafu v 1.3. Which version of datafu are you running?

View solution in original post

5 REPLIES 5

Re: Data Munging with Hadoop DataFu Sampling example getting error

Mentor

@Ofer Mendelevith @scasey any insight?

Re: Data Munging with Hadoop DataFu Sampling example getting error

Explorer

Hi there, you've run into https://issues.apache.org/jira/browse/DATAFU-68 To fix this, you'll need to use datafu v 1.3. Which version of datafu are you running?

View solution in original post

Re: Data Munging with Hadoop DataFu Sampling example getting error

Mentor

Thanks for looking into this Casey, was just talking to Ofer yesrderday about this. I ran this on sandbox 2.3.2. My steps are documented here https://github.com/dbist/datamunging

Would really appreciate some feedback. Also tablesample with percentage query does not work on small datasets. I documented a workaround in my readme.

Re: Data Munging with Hadoop DataFu Sampling example getting error

Mentor

@cstella I just confirmed that it still fails with DataFu 1.3 on latest Sandbox 2.4 v3.

Re: Data Munging with Hadoop DataFu Sampling example getting error

Mentor

on Sandbox 2.5, Datafu is indeed 1.3, validated the function albeit with different dataset

DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.2');
ROWS = load 'sample_08' using HCatLoader();
SAMPLE_BY_total_emp = filter ROWS by SampleByKey(total_emp);
STORE SAMPLE_BY_total_emp into 'sample_total_emp';
[guest@sandbox ~]$ hdfs dfs -cat sample_total_emp/part-v000-o000-r-00000 | head -n 5
11-3011	Administrative services managers	246930	79500
11-9121	Natural sciences managers	43060	123140
13-1032	Insurance appraisers, auto damage	11280	53980
13-1051	Cost estimators	218400	60320
13-1072	Compensation, benefits, and job analysis specialists	116250	57060