<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How to avoid duplicate row insertion in Hive? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-to-avoid-duplicate-row-insertion-in-Hive/m-p/286392#M212423</link>
    <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/33732"&gt;@Prakashcit&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;To ensure data from multiple data sources are ingested to discover at a later stage business insights, usually we dump everything. Comparison of source data with data ingested to simply validate that all the data has been pushed and verifying that correct data files are generated and loaded into HDFS correctly into the desired location.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;A &lt;STRONG&gt;smart data lake ingestion tool&lt;/STRONG&gt; or solution like &lt;A href="https://kylo.io/" target="_blank" rel="noopener"&gt;kylo&lt;/A&gt;&amp;nbsp;should enable self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization,see attached architecture&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Datalake1.PNG" style="width: 842px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/25831iB7B20F16E5DB67FC/image-size/large?v=v2&amp;amp;px=999" role="button" title="Datalake1.PNG" alt="Datalake1.PNG" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;/landing_Zone/Raw_data/&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;STRONG&gt;Corresponding to stage1&lt;/STRONG&gt;]&lt;/LI&gt;&lt;LI&gt;/landing_Zone/Raw_data/refined&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; [ &lt;STRONG&gt;Corresponding to stage2&lt;/STRONG&gt;]&lt;/LI&gt;&lt;LI&gt;/landing_Zone/Raw_data/refined/Trusted Data&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;STRONG&gt;Corresponding to stage3&lt;/STRONG&gt;]&lt;/LI&gt;&lt;LI&gt;/landing_Zone/Raw_data/refined/Trusted Data/sandbox&amp;nbsp; &amp;nbsp;[ &lt;STRONG&gt;Corresponding to stage4&lt;/STRONG&gt;]&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The data lake can be used also to feed upstream systems for a real-time monitoring system or long storage like HDFS or hive for analytics&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Data quality&lt;/STRONG&gt; is often seen as the unglamorous component of working with data. Ironically, it’s usually the component that makes up the majority of our time of data engineers. Data quality might very well be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The challenge with data quality is that there are no clear and simple formulas for determining if data is correct this is a continuous data engineering task as more data sources are incorporated to the data pipeline.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Typically &lt;STRONG&gt;hive plugged&amp;nbsp;on stage 3&lt;/STRONG&gt; and tables are created&amp;nbsp; after the data validation of&lt;STRONG&gt; stage 2&lt;/STRONG&gt; this ensures that data scientists have cleansed data to run their models and analysts using BI tools at least this has been the tasks I have done all through many projects&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;HTH&lt;/P&gt;</description>
    <pubDate>Thu, 26 Dec 2019 20:30:12 GMT</pubDate>
    <dc:creator>Shelton</dc:creator>
    <dc:date>2019-12-26T20:30:12Z</dc:date>
  </channel>
</rss>

