Support Questions
Find answers, ask questions, and share your expertise

Can't delete rows in Hive table + complex hive request

Solved Go to solution

Can't delete rows in Hive table + complex hive request

Rising Star

Hi,

I would like to get rid of some duplicates in my Hive table. The duplicates differs by the variable DateExtraction.

DateExtraction is a date formated as "yyyy/MM/dd". In case of duplicate i would like to keep the most recent observation.

Any clues are most welcome :)

Here is the query i used to create my table:

DROP TABLE IF EXISTS netodb.oneTable;
CREATE EXTERNAL TABLE IF NOT EXISTS netodb.oneTable(
  Entreprise string,Industrie string,Source string,Type string,Auteur string,
  Id string,Texte string,DateCreation string,Likes bigint,Partages bigint,
  ObjetParent string,Users_mentionned string,Hashtags_uses string,View_count string,
  Dislike_count string,Comment_count string,DateExtraction date) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "|"
LOCATION "/tmp/NetoDB/OneTable" 
TBLPROPERTIES("transactional"="true");
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Can't delete rows in Hive table + complex hive request

Guru

You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.

Query for view/new table creation would use the below select.

SELECT
    <columns>
FROM
(SELECT *, RANK() over (partition by xyz,
order by DateExtraction desc) as rank
FROM onetable) ranked_data
WHERE ranked_data.rank=1;

View solution in original post

2 REPLIES 2

Re: Can't delete rows in Hive table + complex hive request

Guru

You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.

Query for view/new table creation would use the below select.

SELECT
    <columns>
FROM
(SELECT *, RANK() over (partition by xyz,
order by DateExtraction desc) as rank
FROM onetable) ranked_data
WHERE ranked_data.rank=1;

View solution in original post

Re: Can't delete rows in Hive table + complex hive request

Rising Star

Works great. Thanks @Ravi Mutyala