question Re: How to avoid duplicate row insertion in Hive? in Support Questions

question Re: How to avoid duplicate row insertion in Hive? in Support Questions https://community.cloudera.com/t5/Support-Questions/How-to-avoid-duplicate-row-insertion-in-Hive/m-p/286392#M212423 <a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/33732">@Prakashcit</a> To ensure data from multiple data sources are ingested to discover at a later stage business insights, usually we dump everything. Comparison of source data with data ingested to simply validate that all the data has been pushed and verifying that correct data files are generated and loaded into HDFS correctly into the desired location. A smart data lake ingestion tool or solution like <A href="https://kylo.io/" target="_blank" rel="noopener">kylo</A> should enable self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization,see attached architecture<img src="https://community.cloudera.com/t5/image/serverpage/image-id/25831iB7B20F16E5DB67FC/image-size/large?v=v2&px=999" role="button" title="Datalake1.PNG" alt="Datalake1.PNG" /><UL><LI>/landing_Zone/Raw_data/                                                                     [ Corresponding to stage1]</LI><LI>/landing_Zone/Raw_data/refined                                                      [ Corresponding to stage2]</LI><LI>/landing_Zone/Raw_data/refined/Trusted Data                       [ Corresponding to stage3]</LI><LI>/landing_Zone/Raw_data/refined/Trusted Data/sandbox   [ Corresponding to stage4]</LI></UL>The data lake can be used also to feed upstream systems for a real-time monitoring system or long storage like HDFS or hive for analyticsData quality is often seen as the unglamorous component of working with data. Ironically, it’s usually the component that makes up the majority of our time of data engineers. Data quality might very well be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless. The challenge with data quality is that there are no clear and simple formulas for determining if data is correct this is a continuous data engineering task as more data sources are incorporated to the data pipeline. Typically hive plugged on stage 3 and tables are created  after the data validation of stage 2 this ensures that data scientists have cleansed data to run their models and analysts using BI tools at least this has been the tasks I have done all through many projects HTH Thu, 26 Dec 2019 20:30:12 GMT Shelton 2019-12-26T20:30:12Z