- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to delete duplicate data in hive table with partition?
- Labels:
-
Apache Hive
Created ‎07-26-2023 08:20 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
The duplicate data from 2023-03-26 to 2023-07-10 must be removed.
I'm trying to remove duplicates from the table using this command, but I'm getting an error.
Commands:
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE db.table_name PARTITION(dt) select distinct * from db.table_name where dt >= '2023-03-26' AND dt >= '2023-07-10';
Error:
23/07/26 16:07:46 [LocalJobRunner Map Task Executor #0]: WARN io.CombineHiveRecordReader: Multiple partitions found; not going to pass a part spec to LLAP IO: {{dt=2023-07-10}} and {{dt=2023-07-11}} 2023-07-26 16:07:47,952 Stage-1 map = 0%, reduce = 0% 23/07/26 16:07:47 [aabca681-0714-44f6-bc8d-9be6d7fca9fc main]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead.
Note: The partition of this table is just the dates.
Example:
show partitions db.table_name;
dt=2023-07-04
dt=2023-07-05
dt=2023-07-06
dt=2023-07-07
dt=2023-07-08
dt=2023-07-09
dt=2023-07-10
$ hive --version
Hive 2.3.3
Hope you can suggest on this one.
Thank you!
Created ‎07-31-2023 11:56 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Noel_0317,
The error indicates that there are multiple partitions in the where condition. Can you try the below query:
INSERT OVERWRITE TABLE db.table_name PARTITION(dt='2023-03-26') select distinct * from db.table_name where dt = '2023-03-26';
Let us know how it goes.
Cheers!
