Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to delete duplicate data in hive table with partition?

avatar
Contributor

Hello,

 

The duplicate data from 2023-03-26 to 2023-07-10 must be removed.

I'm trying to remove duplicates from the table using this command, but I'm getting an error.


Commands:

set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE db.table_name PARTITION(dt) select distinct * from db.table_name where dt >= '2023-03-26' AND dt >= '2023-07-10';

Error: 
23/07/26 16:07:46 [LocalJobRunner Map Task Executor #0]: WARN io.CombineHiveRecordReader: Multiple partitions found; not going to pass a part spec to LLAP IO: {{dt=2023-07-10}} and {{dt=2023-07-11}} 2023-07-26 16:07:47,952 Stage-1 map = 0%, reduce = 0% 23/07/26 16:07:47 [aabca681-0714-44f6-bc8d-9be6d7fca9fc main]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead.

Note: The partition of this table is just the dates.

Example: 

show partitions db.table_name;

dt=2023-07-04
dt=2023-07-05
dt=2023-07-06
dt=2023-07-07
dt=2023-07-08
dt=2023-07-09
dt=2023-07-10

 

$ hive --version
Hive 2.3.3

 

Hope you can suggest on this one.

 

Thank you!

 

1 REPLY 1

avatar
Expert Contributor

Hi @Noel_0317,

The error indicates that there are multiple partitions in the where condition. Can you try the below query:

INSERT OVERWRITE TABLE db.table_name PARTITION(dt='2023-03-26') select distinct * from db.table_name where dt = '2023-03-26';

 

Let us know how it goes.

 

Cheers!