Support Questions

Find answers, ask questions, and share your expertise

Can't delete rows in Hive table + complex hive request

avatar
Expert Contributor

Hi,

I would like to get rid of some duplicates in my Hive table. The duplicates differs by the variable DateExtraction.

DateExtraction is a date formated as "yyyy/MM/dd". In case of duplicate i would like to keep the most recent observation.

Any clues are most welcome 🙂

Here is the query i used to create my table:

DROP TABLE IF EXISTS netodb.oneTable;
CREATE EXTERNAL TABLE IF NOT EXISTS netodb.oneTable(
  Entreprise string,Industrie string,Source string,Type string,Auteur string,
  Id string,Texte string,DateCreation string,Likes bigint,Partages bigint,
  ObjetParent string,Users_mentionned string,Hashtags_uses string,View_count string,
  Dislike_count string,Comment_count string,DateExtraction date) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "|"
LOCATION "/tmp/NetoDB/OneTable" 
TBLPROPERTIES("transactional"="true");
1 ACCEPTED SOLUTION

avatar
Guru

You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.

Query for view/new table creation would use the below select.

SELECT
    <columns>
FROM
(SELECT *, RANK() over (partition by xyz,
order by DateExtraction desc) as rank
FROM onetable) ranked_data
WHERE ranked_data.rank=1;

View solution in original post

2 REPLIES 2

avatar
Guru

You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.

Query for view/new table creation would use the below select.

SELECT
    <columns>
FROM
(SELECT *, RANK() over (partition by xyz,
order by DateExtraction desc) as rank
FROM onetable) ranked_data
WHERE ranked_data.rank=1;

avatar
Expert Contributor

Works great. Thanks @Ravi Mutyala