Support Questions

Find answers, ask questions, and share your expertise

How one should handle de-duplication of data?

avatar
 
1 ACCEPTED SOLUTION

avatar
Master Guru

@milind pandit loaded question. First you have to define what the unique entity is. once that solved then you can use various tools like pig to parse through data and provide you single record. This can also be done via hive by using group by statement on your natural key to provide you single record from source. Lastly you can use tools like information or talend to do the same.

View solution in original post

1 REPLY 1

avatar
Master Guru

@milind pandit loaded question. First you have to define what the unique entity is. once that solved then you can use various tools like pig to parse through data and provide you single record. This can also be done via hive by using group by statement on your natural key to provide you single record from source. Lastly you can use tools like information or talend to do the same.