Support Questions
Find answers, ask questions, and share your expertise

How to avoid duplicate row insertion in Hive?

How to avoid duplicate row insertion in Hive?

Contributor

CREATE TABLE Test( recordId String,
Advertiser_id STRING ,
Advertiser_name STRING,
PRIMARY KEY (recordId) DISABLE NOVALIDATE);

 

insert into Test values('1','prakash','vtv');

insert into Test values('1','prakash','vtv');

 

Test.recordid | Test.advertiser_id | Test.advertiser_name |
+-----------------------------+----------------------------------+------------------------------------+
| 1 | prakash | vtv |
| 1 | prakash | vtv |
+-----------------------------+----------------------------------+------------------------------------+

 

As per RDBMS terms Primary key constraint shouldn't allow me to create duplicate record with the same value.

 

How to achieve this in Hive?

 

 

 

4 REPLIES 4
Highlighted

Re: How to avoid duplicate row insertion in Hive?

Mentor

@Prakashcit 

That by design:

A NOVALIDATE constraint is basically a constraint that can be enabled but for which hive will not check the existing data to determine whether there might be data that currently violate the constraint.

This is useful if we know there’s data that violates the constraint but we want to quickly put on a constraint to prevent further violations, with the intention to clean up any possible violations at some future point in time.

It’s also potentially useful if we know the data is clean and so want to prevent the potentially significant overheads of hive having to check all the data to ensure there are indeed no violations.

Highlighted

Re: How to avoid duplicate row insertion in Hive?

Contributor

@Shelton  Thanks for your response and understood your point Hive will not check existing data in case if NOVALIDATE constraint is specified.

 

Is there any other way that i can insert only unique values for a particular column in Hive? 

Highlighted

Re: How to avoid duplicate row insertion in Hive?

Contributor

There are few BI tools which can be used as a layer on top of Hive which can make use of pk constraints. 

AFAIK Hive does not check/validate pk constraints  

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive_constraints.html

 

It can be added in table definition so that BI tools can make efficient query decisions and perform checks on key constraints. 

Highlighted

Re: How to avoid duplicate row insertion in Hive?

Mentor

@Prakashcit 

To ensure data from multiple data sources are ingested to discover at a later stage business insights, usually we dump everything. Comparison of source data with data ingested to simply validate that all the data has been pushed and verifying that correct data files are generated and loaded into HDFS correctly into the desired location.

 

A smart data lake ingestion tool or solution like kylo should enable self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization,see attached architecture

Datalake1.PNG

  • /landing_Zone/Raw_data/                                                                     [ Corresponding to stage1]
  • /landing_Zone/Raw_data/refined                                                      [ Corresponding to stage2]
  • /landing_Zone/Raw_data/refined/Trusted Data                       [ Corresponding to stage3]
  • /landing_Zone/Raw_data/refined/Trusted Data/sandbox   [ Corresponding to stage4]

The data lake can be used also to feed upstream systems for a real-time monitoring system or long storage like HDFS or hive for analytics

Data quality is often seen as the unglamorous component of working with data. Ironically, it’s usually the component that makes up the majority of our time of data engineers. Data quality might very well be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless.

 

The challenge with data quality is that there are no clear and simple formulas for determining if data is correct this is a continuous data engineering task as more data sources are incorporated to the data pipeline.

 

Typically hive plugged on stage 3 and tables are created  after the data validation of stage 2 this ensures that data scientists have cleansed data to run their models and analysts using BI tools at least this has been the tasks I have done all through many projects

HTH