Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Can't delete rows in Hive table + complex hive request

avatar
Expert Contributor

Hi,

I would like to get rid of some duplicates in my Hive table. The duplicates differs by the variable DateExtraction.

DateExtraction is a date formated as "yyyy/MM/dd". In case of duplicate i would like to keep the most recent observation.

Any clues are most welcome 🙂

Here is the query i used to create my table:

DROP TABLE IF EXISTS netodb.oneTable;
CREATE EXTERNAL TABLE IF NOT EXISTS netodb.oneTable(
  Entreprise string,Industrie string,Source string,Type string,Auteur string,
  Id string,Texte string,DateCreation string,Likes bigint,Partages bigint,
  ObjetParent string,Users_mentionned string,Hashtags_uses string,View_count string,
  Dislike_count string,Comment_count string,DateExtraction date) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "|"
LOCATION "/tmp/NetoDB/OneTable" 
TBLPROPERTIES("transactional"="true");
1 ACCEPTED SOLUTION

avatar
Guru

You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.

Query for view/new table creation would use the below select.

SELECT
    <columns>
FROM
(SELECT *, RANK() over (partition by xyz,
order by DateExtraction desc) as rank
FROM onetable) ranked_data
WHERE ranked_data.rank=1;

View solution in original post

2 REPLIES 2

avatar
Guru

You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.

Query for view/new table creation would use the below select.

SELECT
    <columns>
FROM
(SELECT *, RANK() over (partition by xyz,
order by DateExtraction desc) as rank
FROM onetable) ranked_data
WHERE ranked_data.rank=1;

avatar
Expert Contributor

Works great. Thanks @Ravi Mutyala