Support Questions

lubinlemarchand · ‎05-18-2016

Hi,

I would like to get rid of some duplicates in my Hive table. The duplicates differs by the variable DateExtraction.

DateExtraction is a date formated as "yyyy/MM/dd". In case of duplicate i would like to keep the most recent observation.

Any clues are most welcome 🙂

Here is the query i used to create my table:

DROP TABLE IF EXISTS netodb.oneTable;
CREATE EXTERNAL TABLE IF NOT EXISTS netodb.oneTable(
  Entreprise string,Industrie string,Source string,Type string,Auteur string,
  Id string,Texte string,DateCreation string,Likes bigint,Partages bigint,
  ObjetParent string,Users_mentionned string,Hashtags_uses string,View_count string,
  Dislike_count string,Comment_count string,DateExtraction date) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "|"
LOCATION "/tmp/NetoDB/OneTable" 
TBLPROPERTIES("transactional"="true");

ravi1 · ‎05-18-2016

You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.

Query for view/new table creation would use the below select.

SELECT
    <columns>
FROM
(SELECT *, RANK() over (partition by xyz,
order by DateExtraction desc) as rank
FROM onetable) ranked_data
WHERE ranked_data.rank=1;

View solution in original post

ravi1 · ‎05-18-2016

You can use either a selfjoin or rank to get only the latest extraction date. This can then either go into a view on top of your table or a new table that does not have duplicates.

Query for view/new table creation would use the below select.

SELECT
    <columns>
FROM
(SELECT *, RANK() over (partition by xyz,
order by DateExtraction desc) as rank
FROM onetable) ranked_data
WHERE ranked_data.rank=1;

lubinlemarchand · ‎05-19-2016

Works great. Thanks @Ravi Mutyala

Cloudera Community

Support Questions

Can't delete rows in Hive table + complex hive request