<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How one should handle de-duplication of data? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-one-should-handle-de-duplication-of-data/m-p/120016#M82794</link>
    <description>&lt;P&gt; &lt;A rel="user" href="https://community.cloudera.com/users/9842/mpandit.html" nodeid="9842"&gt;@milind pandit&lt;/A&gt; loaded question.  First you have to define what the unique entity is.  once that solved then you can use various tools like pig to parse through data and provide you single record.  This can also be done via hive by using group by statement on your natural key to provide you single record from source.  Lastly you can use tools like information or talend to do the same.&lt;/P&gt;</description>
    <pubDate>Fri, 26 Aug 2016 03:16:38 GMT</pubDate>
    <dc:creator>sunile_manjee</dc:creator>
    <dc:date>2016-08-26T03:16:38Z</dc:date>
    <item>
      <title>How one should handle de-duplication of data?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-one-should-handle-de-duplication-of-data/m-p/120015#M82793</link>
      <description />
      <pubDate>Fri, 26 Aug 2016 02:13:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-one-should-handle-de-duplication-of-data/m-p/120015#M82793</guid>
      <dc:creator>mpandit</dc:creator>
      <dc:date>2016-08-26T02:13:01Z</dc:date>
    </item>
    <item>
      <title>Re: How one should handle de-duplication of data?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-one-should-handle-de-duplication-of-data/m-p/120016#M82794</link>
      <description>&lt;P&gt; &lt;A rel="user" href="https://community.cloudera.com/users/9842/mpandit.html" nodeid="9842"&gt;@milind pandit&lt;/A&gt; loaded question.  First you have to define what the unique entity is.  once that solved then you can use various tools like pig to parse through data and provide you single record.  This can also be done via hive by using group by statement on your natural key to provide you single record from source.  Lastly you can use tools like information or talend to do the same.&lt;/P&gt;</description>
      <pubDate>Fri, 26 Aug 2016 03:16:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-one-should-handle-de-duplication-of-data/m-p/120016#M82794</guid>
      <dc:creator>sunile_manjee</dc:creator>
      <dc:date>2016-08-26T03:16:38Z</dc:date>
    </item>
  </channel>
</rss>

