<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Data Munging with Hadoop DataFu Sampling example getting error in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159576#M21013</link>
    <description>&lt;P&gt;Hi there, you've run into &lt;A href="https://issues.apache.org/jira/browse/DATAFU-68" target="_blank"&gt;https://issues.apache.org/jira/browse/DATAFU-68&lt;/A&gt;  To fix this, you'll need to use datafu v 1.3.  Which version of datafu are you running?&lt;/P&gt;</description>
    <pubDate>Fri, 08 Apr 2016 00:53:04 GMT</pubDate>
    <dc:creator>cstella</dc:creator>
    <dc:date>2016-04-08T00:53:04Z</dc:date>
    <item>
      <title>Data Munging with Hadoop DataFu Sampling example getting error</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159574#M21011</link>
      <description>&lt;P&gt;it may be my data that causes the problem as I had to create my own table with the downloaded dataset so I used my own discretion. When I run the following code&lt;/P&gt;&lt;PRE&gt;DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.2');
ROWS = load 'medicare_part_b.medicare_part_b_2013_raw' using HCatLoader();
SAMPLE_BY_PROVIDERS = filter ROWS by SampleByKey(npi);
rmf medicare_part_b/ex2_by_npi_sample;
STORE SAMPLE_BY_PROVIDERS into 'medicare_part_b/ex2_by_npi_sample' using PigStorage(',');&lt;/PRE&gt;&lt;P&gt;I get the following error&lt;/P&gt;&lt;PRE&gt;2016-02-26 02:07:58,053 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS
Details at logfile: /root/pig_1456451995224.log
&lt;/PRE&gt;&lt;P&gt;the log file shows this&lt;/P&gt;&lt;PRE&gt;Pig Stack Trace
---------------
ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS


org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias SAMPLE_BY_PROVIDERS
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1694)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:623)
        at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1082)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
        at org.apache.pig.Main.run(Main.java:565)
        at org.apache.pig.Main.main(Main.java:177)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: java.lang.NullPointerException
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:310)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
        at org.apache.pig.PigServer.execute(PigServer.java:1364)
        at org.apache.pig.PigServer.access$500(PigServer.java:113)
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1689)
        ... 14 more
Caused by: java.lang.NullPointerException
Caused by: java.lang.NullPointerException
        at datafu.pig.sampling.SampleByKey.setUDFContextSignature(SampleByKey.java:86)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.setSignature(POUserFunc.java:611)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.instantiateFunc(POUserFunc.java:125)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.&amp;lt;init&amp;gt;(POUserFunc.java:120)
        at org.apache.pig.newplan.logical.expression.ExpToPhyTranslationVisitor.visit(ExpToPhyTranslationVisitor.java:505)
        at org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:113)
        at org.apache.pig.newplan.ReverseDependencyOrderWalkerWOSeenChk.walk(ReverseDependencyOrderWalkerWOSeenChk.java:69)
        at org.apache.pig.newplan.logical.relational.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:223)
        at org.apache.pig.newplan.logical.relational.LOFilter.accept(LOFilter.java:79)
        at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:260)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:295)
        ... 19 more
&lt;/PRE&gt;&lt;P&gt;does not work in TEZ or MAPREDUCE mode&lt;/P&gt;&lt;P&gt;scheme for the relation SAMPLE_BY_PROVIDERS&lt;/P&gt;&lt;PRE&gt;grunt&amp;gt; DESCRIBE SAMPLE_BY_PROVIDERS;
SAMPLE_BY_PROVIDERS: {npi: chararray,nppes_provider_last_org_name: chararray,nppes_provider_first_name: chararray,nppes_provider_mi: chararray,nppes_credentials: chararray,nppes_provider_gender: chararray,nppes_entity_code: chararray,nppes_provider_street1: chararray,nppes_provider_street2: chararray,nppes_provider_city: chararray,nppes_provider_zip: chararray,nppes_provider_state: chararray,nppes_provider_country: chararray,provider_type: chararray,medicare_participation_indicator: chararray,places_of_service: chararray,hcpcs_code: chararray,hcpcs_desc: chararray,hcpcs_drug_indicator: chararray,line_srvc_cnt: int,bene_unique_cnt: int,bene_day_srvc_cnt: int,average_medicare_all_owed_amt: chararray,average_submitted_chrg_amt: chararray,stdev_submitted_chrg_amt: chararray,average_medicare_payment_amt: chararray,stdev_medicare_payment_amt: chararray}
&lt;/PRE&gt;</description>
      <pubDate>Fri, 26 Feb 2016 10:12:45 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159574#M21011</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2016-02-26T10:12:45Z</dc:date>
    </item>
    <item>
      <title>Re: Data Munging with Hadoop DataFu Sampling example getting error</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159575#M21012</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/233/omendelevitch.html" nodeid="233"&gt;@Ofer Mendelevith&lt;/A&gt; &lt;A rel="user" href="https://community.cloudera.com/users/232/scasey.html" nodeid="232"&gt;@scasey&lt;/A&gt; any insight?&lt;/P&gt;</description>
      <pubDate>Fri, 26 Feb 2016 10:13:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159575#M21012</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2016-02-26T10:13:26Z</dc:date>
    </item>
    <item>
      <title>Re: Data Munging with Hadoop DataFu Sampling example getting error</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159576#M21013</link>
      <description>&lt;P&gt;Hi there, you've run into &lt;A href="https://issues.apache.org/jira/browse/DATAFU-68" target="_blank"&gt;https://issues.apache.org/jira/browse/DATAFU-68&lt;/A&gt;  To fix this, you'll need to use datafu v 1.3.  Which version of datafu are you running?&lt;/P&gt;</description>
      <pubDate>Fri, 08 Apr 2016 00:53:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159576#M21013</guid>
      <dc:creator>cstella</dc:creator>
      <dc:date>2016-04-08T00:53:04Z</dc:date>
    </item>
    <item>
      <title>Re: Data Munging with Hadoop DataFu Sampling example getting error</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159577#M21014</link>
      <description>&lt;P&gt;Thanks for looking into this Casey, was just talking to Ofer yesrderday about this. I ran this on sandbox 2.3.2. My steps are documented here &lt;A href="https://github.com/dbist/datamunging" target="_blank"&gt;https://github.com/dbist/datamunging&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Would really appreciate some feedback. Also tablesample with percentage query does not work on small datasets. I documented a workaround in my readme.&lt;/P&gt;</description>
      <pubDate>Fri, 08 Apr 2016 01:01:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159577#M21014</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2016-04-08T01:01:04Z</dc:date>
    </item>
    <item>
      <title>Re: Data Munging with Hadoop DataFu Sampling example getting error</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159578#M21015</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/597/cstella.html" nodeid="597"&gt;@cstella&lt;/A&gt; I just confirmed that it still fails with DataFu 1.3 on latest Sandbox 2.4 v3.&lt;/P&gt;</description>
      <pubDate>Thu, 21 Apr 2016 08:26:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159578#M21015</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2016-04-21T08:26:53Z</dc:date>
    </item>
    <item>
      <title>Re: Data Munging with Hadoop DataFu Sampling example getting error</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159579#M21016</link>
      <description>&lt;P&gt;on Sandbox 2.5, Datafu is indeed 1.3, validated the function albeit with different dataset&lt;/P&gt;&lt;PRE&gt;DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.2');
ROWS = load 'sample_08' using HCatLoader();
SAMPLE_BY_total_emp = filter ROWS by SampleByKey(total_emp);
STORE SAMPLE_BY_total_emp into 'sample_total_emp';
&lt;/PRE&gt;&lt;PRE&gt;[guest@sandbox ~]$ hdfs dfs -cat sample_total_emp/part-v000-o000-r-00000 | head -n 5
11-3011	Administrative services managers	246930	79500
11-9121	Natural sciences managers	43060	123140
13-1032	Insurance appraisers, auto damage	11280	53980
13-1051	Cost estimators	218400	60320
13-1072	Compensation, benefits, and job analysis specialists	116250	57060
&lt;/PRE&gt;</description>
      <pubDate>Tue, 26 Jul 2016 23:19:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Data-Munging-with-Hadoop-DataFu-Sampling-example-getting/m-p/159579#M21016</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2016-07-26T23:19:32Z</dc:date>
    </item>
  </channel>
</rss>

