<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Big Data Analytics - Approach for Data Quality phase in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158508#M36508</link>
    <description>&lt;P&gt;Paste the error you're getting&lt;/P&gt;</description>
    <pubDate>Tue, 02 Aug 2016 06:53:59 GMT</pubDate>
    <dc:creator>aervits</dc:creator>
    <dc:date>2016-08-02T06:53:59Z</dc:date>
    <item>
      <title>Big Data Analytics - Approach for Data Quality phase</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158505#M36505</link>
      <description>&lt;P&gt;I doing a small project in Hadoop which the main goal is create some KPI in Hive. However I needed to do some ETL jobs using Pig to clean my data and I put the transformed files into a new directory in HDFS. To ensure that all the files are in correct form, I want to create some data quality activities in Java or Python. I tried to to use PIG UDFs to achieve this but I couldn't connect the Jar file with Pig. Since I can't use PIG UDFs, I'm planning a new approach to do the data quality phase:

1) Run the PIG scripts to clean the data and extract the new files into a new directory in HDFS
2) Put Java/Python independentely read the new files and perform the data quality activities
3) If the Data Quality tests return sucessfully load the files into Hive

In your opinion this a good approach for a Big Data project? I'm new in this topic... If not, what a good alternative for perform data quality jobs in this project?

Many thanks for your help!&lt;/P&gt;</description>
      <pubDate>Mon, 01 Aug 2016 02:23:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158505#M36505</guid>
      <dc:creator>prodgers125</dc:creator>
      <dc:date>2016-08-01T02:23:34Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data Analytics - Approach for Data Quality phase</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158506#M36506</link>
      <description>&lt;P&gt;Running UDFs in pig is what pig is for. You should fix that problem. Have you registered your jars?&lt;/P&gt;&lt;P&gt;&lt;A href="http://pig.apache.org/docs/r0.16.0/udf.html#udf-java" target="_blank"&gt;http://pig.apache.org/docs/r0.16.0/udf.html#udf-java&lt;/A&gt;&lt;/P&gt;&lt;P&gt;There are other possibilities as well, Spark comes to mind esp. with python it can be relatively easy to setup ( although it also has its problems like python versions ) And there are some ETL tools that can utilize hadoop. But by and large pig with java udfs is a very straight forward way to do custom data cleaning on data in hadoop. There is no reason you shouldn't get it to  work. &lt;/P&gt;</description>
      <pubDate>Mon, 01 Aug 2016 17:49:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158506#M36506</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-08-01T17:49:27Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data Analytics - Approach for Data Quality phase</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158507#M36507</link>
      <description>&lt;P&gt;Hi Benjamin, &lt;/P&gt;&lt;P&gt;I follow that steps to include Java UDFs in Pig but it always gives me error... that's way I'm looking for alternatives.&lt;/P&gt;</description>
      <pubDate>Mon, 01 Aug 2016 17:50:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158507#M36507</guid>
      <dc:creator>prodgers125</dc:creator>
      <dc:date>2016-08-01T17:50:29Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data Analytics - Approach for Data Quality phase</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158508#M36508</link>
      <description>&lt;P&gt;Paste the error you're getting&lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2016 06:53:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158508#M36508</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2016-08-02T06:53:59Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data Analytics - Approach for Data Quality phase</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158509#M36509</link>
      <description>&lt;P&gt;If I use Python inside a file.py in my HDFS I can run Pytho UDFs but with Java I'm getting error... I think I'm not getting all the files&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2016 15:07:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158509#M36509</guid>
      <dc:creator>prodgers125</dc:creator>
      <dc:date>2016-08-03T15:07:24Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data Analytics - Approach for Data Quality phase</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158510#M36510</link>
      <description>&lt;P&gt;yeah but without the error we cannot really help. I suppose you mean a classnotfound exception? So your udf uses a lot of exotic imports? &lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2016 21:56:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158510#M36510</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-08-03T21:56:53Z</dc:date>
    </item>
    <item>
      <title>Re: Big Data Analytics - Approach for Data Quality phase</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158511#M36511</link>
      <description>&lt;P&gt;I was missing some Jar files &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Aug 2016 17:46:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Big-Data-Analytics-Approach-for-Data-Quality-phase/m-p/158511#M36511</guid>
      <dc:creator>prodgers125</dc:creator>
      <dc:date>2016-08-04T17:46:24Z</dc:date>
    </item>
  </channel>
</rss>

