- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Any work round to avoid duplicate records in impala for Primary key column
- Labels:
-
Apache Impala
Created on ‎07-14-2017 09:06 PM - edited ‎09-16-2022 04:55 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Appreciate if any work round to avoid duplicate records in impala for Primary key column.
Created ‎07-15-2017 08:03 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Impala does not have a concept of PK .However You have two options
down the road if you want to implement delete single row you cant perform them on Hive / Impala . So you can implement using Impala-kudu format . Kudu format you can create table with primary key , plus you perform single row delete.
or the hard way to achive this is to
STEP 1
CREATE TABLE Sample ( name STRING, street STRING, RD123 Timestamp ,(Assume this is unique since we dont have Pk) ) STEP 2
Perform the LOAD DATA INTO Sample
STEP 3 - Create another table
Create table sample_no_dupli AS select SELECT col1,col2,MAX(RD123) AS createdate FROM JLT_STAHING GROUP BY name,street
Created ‎07-14-2017 11:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you asking pertain to inseration or reterival of data ?
Created ‎07-14-2017 11:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thinking of avoidng duplicates while insertion if this won't cause performacne issue.
Created ‎07-15-2017 04:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
https://www.cloudera.com/documentation/enterprise/5-3-x/topics/impala_porting.html
According to the above link, Take out any CREATE INDEX, DROP INDEX, and ALTER INDEX statements, and equivalent ALTER TABLEstatements. Remove any INDEX, KEY, or PRIMARY KEY clauses from CREATE TABLE and ALTER TABLE statements. Impala is optimized for bulk read operations for data warehouse-style queries, and therefore does not support indexes for its tables.
Yes in general, you cannot achieve both Performance and Indexing. So if possible, you can try to control duplicate in the source (select) portion instead of target (insert) portion
Ex:
insert into table trg_table
select * from src_table
Created ‎07-15-2017 08:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎07-15-2017 08:03 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Impala does not have a concept of PK .However You have two options
down the road if you want to implement delete single row you cant perform them on Hive / Impala . So you can implement using Impala-kudu format . Kudu format you can create table with primary key , plus you perform single row delete.
or the hard way to achive this is to
STEP 1
CREATE TABLE Sample ( name STRING, street STRING, RD123 Timestamp ,(Assume this is unique since we dont have Pk) ) STEP 2
Perform the LOAD DATA INTO Sample
STEP 3 - Create another table
Create table sample_no_dupli AS select SELECT col1,col2,MAX(RD123) AS createdate FROM JLT_STAHING GROUP BY name,street
Created ‎07-15-2017 09:01 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
csguna, appreciate your inputs. will try this.
Created ‎07-15-2017 09:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Msdhan You Welcome :))
