question Re: Any work round to avoid duplicate records in impala for Primary key column in Archives of Support Questions (Read Only)

Any work round to avoid duplicate records in impala for Primary key column

Msdhan — Fri, 16 Sep 2022 11:55:54 GMT

Appreciate if any work round to avoid duplicate records in impala for Primary key column.

Re: Any work round to avoid duplicate records in impala for Primary key column

csguna — Sat, 15 Jul 2017 06:10:11 GMT

Are you asking pertain to inseration or reterival of data ?

Re: Any work round to avoid duplicate records in impala for Primary key column

Msdhan — Sat, 15 Jul 2017 06:42:17 GMT

thinking of avoidng duplicates while insertion if this won't cause performacne issue.

Re: Any work round to avoid duplicate records in impala for Primary key column

saranvisa — Sat, 15 Jul 2017 23:42:25 GMT

@Msdhan

https://www.cloudera.com/documentation/enterprise/5-3-x/topics/impala_porting.html

According to the above link, Take out any CREATE INDEX, DROP INDEX, and ALTER INDEX statements, and equivalent ALTER TABLEstatements. Remove any INDEX, KEY, or PRIMARY KEY clauses from CREATE TABLE and ALTER TABLE statements. Impala is optimized for bulk read operations for data warehouse-style queries, and therefore does not support indexes for its tables.

Yes in general, you cannot achieve both Performance and Indexing. So if possible, you can try to control duplicate in the source (select) portion instead of target (insert) portion

Ex:

insert into table trg_table

select * from src_table

Re: Any work round to avoid duplicate records in impala for Primary key column

csguna — Sun, 16 Jul 2017 03:03:07 GMT

Impala does not have a concept of PK .However You have two options

down the road if you want to implement delete single row you cant perform them on Hive / Impala . So you can implement using Impala-kudu format . Kudu format you can create table with primary key , plus you perform single row delete.

or the hard way to achive this is to

STEP 1

CREATE TABLE Sample
(
    name STRING,
    street  STRING,
    RD123      Timestamp ,(Assume this is unique since we dont have Pk)
    
)

STEP 2
 
Perform the LOAD DATA INTO Sample

STEP 3 - Create another table 

Create table sample_no_dupli AS select SELECT col1,col2,MAX(RD123) AS createdate FROM JLT_STAHING
GROUP BY name,street

Re: Any work round to avoid duplicate records in impala for Primary key column

Msdhan — Sun, 16 Jul 2017 03:58:11 GMT

Thanks Saranvisa for this explanation

Re: Any work round to avoid duplicate records in impala for Primary key column

Msdhan — Sun, 16 Jul 2017 04:01:46 GMT

csguna, appreciate your inputs. will try this.

Re: Any work round to avoid duplicate records in impala for Primary key column

csguna — Sun, 16 Jul 2017 04:38:54 GMT

@Msdhan You Welcome :))