Created 05-18-2016 03:16 PM
Hi,
I would like to get rid of some duplicates in my Hive table. The duplicates differs by the variable DateExtraction.
DateExtraction is a date formated as "yyyy/MM/dd". In case of duplicate i would like to keep the most recent observation.
Any clues are most welcome 🙂
Here is the query i used to create my table:
DROP TABLE IF EXISTS netodb.oneTable;
CREATE EXTERNAL TABLE IF NOT EXISTS netodb.oneTable(
Entreprise string,Industrie string,Source string,Type string,Auteur string,
Id string,Texte string,DateCreation string,Likes bigint,Partages bigint,
ObjetParent string,Users_mentionned string,Hashtags_uses string,View_count string,
Dislike_count string,Comment_count string,DateExtraction date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "|"
LOCATION "/tmp/NetoDB/OneTable"
TBLPROPERTIES("transactional"="true");
Created 05-18-2016 03:41 PM
You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.
Query for view/new table creation would use the below select.
SELECT
<columns>
FROM
(SELECT *, RANK() over (partition by xyz,
order by DateExtraction desc) as rank
FROM onetable) ranked_data
WHERE ranked_data.rank=1;
Created 05-18-2016 03:41 PM
You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.
Query for view/new table creation would use the below select.
SELECT
<columns>
FROM
(SELECT *, RANK() over (partition by xyz,
order by DateExtraction desc) as rank
FROM onetable) ranked_data
WHERE ranked_data.rank=1;
Created 05-19-2016 08:51 AM
Works great. Thanks @Ravi Mutyala