Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Data Munging with Hadoop DataFu Sampling example getting error

Solved Go to solution

Data Munging with Hadoop DataFu Sampling example getting error

Mentor

it may be my data that causes the problem as I had to create my own table with the downloaded dataset so I used my own discretion. When I run the following code

DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.2');
ROWS = load 'medicare_part_b.medicare_part_b_2013_raw' using HCatLoader();
SAMPLE_BY_PROVIDERS = filter ROWS by SampleByKey(npi);
rmf medicare_part_b/ex2_by_npi_sample;
STORE SAMPLE_BY_PROVIDERS into 'medicare_part_b/ex2_by_npi_sample' using PigStorage(',');

I get the following error

2016-02-26 02:07:58,053 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS
Details at logfile: /root/pig_1456451995224.log

the log file shows this

Pig Stack Trace
---------------
ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS


org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1694)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:623)
        at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1082)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
        at org.apache.pig.Main.run(Main.java:565)
        at org.apache.pig.Main.main(Main.java:177)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: java.lang.NullPointerException
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:310)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
        at org.apache.pig.PigServer.execute(PigServer.java:1364)
        at org.apache.pig.PigServer.access$500(PigServer.java:113)
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1689)
        ... 14 more
Caused by: java.lang.NullPointerException
Caused by: java.lang.NullPointerException
        at datafu.pig.sampling.SampleByKey.setUDFContextSignature(SampleByKey.java:86)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.setSignature(POUserFunc.java:611)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.instantiateFunc(POUserFunc.java:125)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.<init>(POUserFunc.java:120)
        at org.apache.pig.newplan.logical.expression.ExpToPhyTranslationVisitor.visit(ExpToPhyTranslationVisitor.java:505)
        at org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:113)
        at org.apache.pig.newplan.ReverseDependencyOrderWalkerWOSeenChk.walk(ReverseDependencyOrderWalkerWOSeenChk.java:69)
        at org.apache.pig.newplan.logical.relational.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:223)
        at org.apache.pig.newplan.logical.relational.LOFilter.accept(LOFilter.java:79)
        at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:260)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:295)
        ... 19 more

does not work in TEZ or MAPREDUCE mode

scheme for the relation SAMPLE_BY_PROVIDERS

grunt> DESCRIBE SAMPLE_BY_PROVIDERS;
SAMPLE_BY_PROVIDERS: {npi: chararray,nppes_provider_last_org_name: chararray,nppes_provider_first_name: chararray,nppes_provider_mi: chararray,nppes_credentials: chararray,nppes_provider_gender: chararray,nppes_entity_code: chararray,nppes_provider_street1: chararray,nppes_provider_street2: chararray,nppes_provider_city: chararray,nppes_provider_zip: chararray,nppes_provider_state: chararray,nppes_provider_country: chararray,provider_type: chararray,medicare_participation_indicator: chararray,places_of_service: chararray,hcpcs_code: chararray,hcpcs_desc: chararray,hcpcs_drug_indicator: chararray,line_srvc_cnt: int,bene_unique_cnt: int,bene_day_srvc_cnt: int,average_medicare_all_owed_amt: chararray,average_submitted_chrg_amt: chararray,stdev_submitted_chrg_amt: chararray,average_medicare_payment_amt: chararray,stdev_medicare_payment_amt: chararray}
1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Data Munging with Hadoop DataFu Sampling example getting error

Explorer

Hi there, you've run into https://issues.apache.org/jira/browse/DATAFU-68 To fix this, you'll need to use datafu v 1.3. Which version of datafu are you running?

View solution in original post

5 REPLIES 5
Highlighted

Re: Data Munging with Hadoop DataFu Sampling example getting error

Mentor

@Ofer Mendelevith @scasey any insight?

Highlighted

Re: Data Munging with Hadoop DataFu Sampling example getting error

Explorer

Hi there, you've run into https://issues.apache.org/jira/browse/DATAFU-68 To fix this, you'll need to use datafu v 1.3. Which version of datafu are you running?

View solution in original post

Highlighted

Re: Data Munging with Hadoop DataFu Sampling example getting error

Mentor

Thanks for looking into this Casey, was just talking to Ofer yesrderday about this. I ran this on sandbox 2.3.2. My steps are documented here https://github.com/dbist/datamunging

Would really appreciate some feedback. Also tablesample with percentage query does not work on small datasets. I documented a workaround in my readme.

Highlighted

Re: Data Munging with Hadoop DataFu Sampling example getting error

Mentor

@cstella I just confirmed that it still fails with DataFu 1.3 on latest Sandbox 2.4 v3.

Highlighted

Re: Data Munging with Hadoop DataFu Sampling example getting error

Mentor

on Sandbox 2.5, Datafu is indeed 1.3, validated the function albeit with different dataset

DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.2');
ROWS = load 'sample_08' using HCatLoader();
SAMPLE_BY_total_emp = filter ROWS by SampleByKey(total_emp);
STORE SAMPLE_BY_total_emp into 'sample_total_emp';
[guest@sandbox ~]$ hdfs dfs -cat sample_total_emp/part-v000-o000-r-00000 | head -n 5
11-3011	Administrative services managers	246930	79500
11-9121	Natural sciences managers	43060	123140
13-1032	Insurance appraisers, auto damage	11280	53980
13-1051	Cost estimators	218400	60320
13-1072	Compensation, benefits, and job analysis specialists	116250	57060
Don't have an account?
Coming from Hortonworks? Activate your account here