Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How one should handle de-duplication of data?

avatar
 
1 ACCEPTED SOLUTION

avatar
Master Guru

@milind pandit loaded question. First you have to define what the unique entity is. once that solved then you can use various tools like pig to parse through data and provide you single record. This can also be done via hive by using group by statement on your natural key to provide you single record from source. Lastly you can use tools like information or talend to do the same.

View solution in original post

1 REPLY 1

avatar
Master Guru

@milind pandit loaded question. First you have to define what the unique entity is. once that solved then you can use various tools like pig to parse through data and provide you single record. This can also be done via hive by using group by statement on your natural key to provide you single record from source. Lastly you can use tools like information or talend to do the same.