Created 05-18-2016 03:16 PM
Hi,
I would like to get rid of some duplicates in my Hive table. The duplicates differs by the variable DateExtraction.
DateExtraction is a date formated as "yyyy/MM/dd". In case of duplicate i would like to keep the most recent observation.
Any clues are most welcome 🙂
Here is the query i used to create my table:
DROP TABLE IF EXISTS netodb.oneTable; CREATE EXTERNAL TABLE IF NOT EXISTS netodb.oneTable( Entreprise string,Industrie string,Source string,Type string,Auteur string, Id string,Texte string,DateCreation string,Likes bigint,Partages bigint, ObjetParent string,Users_mentionned string,Hashtags_uses string,View_count string, Dislike_count string,Comment_count string,DateExtraction date) ROW FORMAT DELIMITED FIELDS TERMINATED BY "|" LOCATION "/tmp/NetoDB/OneTable" TBLPROPERTIES("transactional"="true");
Created 05-18-2016 03:41 PM
You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.
Query for view/new table creation would use the below select.
SELECT <columns> FROM (SELECT *, RANK() over (partition by xyz, order by DateExtraction desc) as rank FROM onetable) ranked_data WHERE ranked_data.rank=1;
Created 05-18-2016 03:41 PM
You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.
Query for view/new table creation would use the below select.
SELECT <columns> FROM (SELECT *, RANK() over (partition by xyz, order by DateExtraction desc) as rank FROM onetable) ranked_data WHERE ranked_data.rank=1;
Created 05-19-2016 08:51 AM
Works great. Thanks @Ravi Mutyala