Created on 12-20-2019 04:27 AM - last edited on 12-20-2019 06:22 AM by VidyaSargur
CREATE TABLE Test( recordId String,
Advertiser_id STRING ,
Advertiser_name STRING,
PRIMARY KEY (recordId) DISABLE NOVALIDATE);
insert into Test values('1','prakash','vtv');
insert into Test values('1','prakash','vtv');
Test.recordid | Test.advertiser_id | Test.advertiser_name |
+-----------------------------+----------------------------------+------------------------------------+
| 1 | prakash | vtv |
| 1 | prakash | vtv |
+-----------------------------+----------------------------------+------------------------------------+
As per RDBMS terms Primary key constraint shouldn't allow me to create duplicate record with the same value.
How to achieve this in Hive?
Created on 12-22-2019 10:13 AM - edited 12-22-2019 10:17 AM
That by design:
A NOVALIDATE constraint is basically a constraint that can be enabled but for which hive will not check the existing data to determine whether there might be data that currently violate the constraint.
This is useful if we know there’s data that violates the constraint but we want to quickly put on a constraint to prevent further violations, with the intention to clean up any possible violations at some future point in time.
It’s also potentially useful if we know the data is clean and so want to prevent the potentially significant overheads of hive having to check all the data to ensure there are indeed no violations.
Created 12-23-2019 02:06 AM
@Shelton Thanks for your response and understood your point Hive will not check existing data in case if NOVALIDATE constraint is specified.
Is there any other way that i can insert only unique values for a particular column in Hive?
Created 12-26-2019 06:54 AM
There are few BI tools which can be used as a layer on top of Hive which can make use of pk constraints.
AFAIK Hive does not check/validate pk constraints
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive_constraints.html
It can be added in table definition so that BI tools can make efficient query decisions and perform checks on key constraints.
Created 12-26-2019 12:30 PM
To ensure data from multiple data sources are ingested to discover at a later stage business insights, usually we dump everything. Comparison of source data with data ingested to simply validate that all the data has been pushed and verifying that correct data files are generated and loaded into HDFS correctly into the desired location.
A smart data lake ingestion tool or solution like kylo should enable self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization,see attached architecture
The data lake can be used also to feed upstream systems for a real-time monitoring system or long storage like HDFS or hive for analytics
Data quality is often seen as the unglamorous component of working with data. Ironically, it’s usually the component that makes up the majority of our time of data engineers. Data quality might very well be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless.
The challenge with data quality is that there are no clear and simple formulas for determining if data is correct this is a continuous data engineering task as more data sources are incorporated to the data pipeline.
Typically hive plugged on stage 3 and tables are created after the data validation of stage 2 this ensures that data scientists have cleansed data to run their models and analysts using BI tools at least this has been the tasks I have done all through many projects
HTH