Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How one should handle de-duplication of data?

Solved Go to solution

How one should handle de-duplication of data?

 
1 ACCEPTED SOLUTION

Accepted Solutions

Re: How one should handle de-duplication of data?

Super Guru

@milind pandit loaded question. First you have to define what the unique entity is. once that solved then you can use various tools like pig to parse through data and provide you single record. This can also be done via hive by using group by statement on your natural key to provide you single record from source. Lastly you can use tools like information or talend to do the same.

1 REPLY 1

Re: How one should handle de-duplication of data?

Super Guru

@milind pandit loaded question. First you have to define what the unique entity is. once that solved then you can use various tools like pig to parse through data and provide you single record. This can also be done via hive by using group by statement on your natural key to provide you single record from source. Lastly you can use tools like information or talend to do the same.

Don't have an account?
Coming from Hortonworks? Activate your account here