Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

removing duplicate and similar data from hadoop

removing duplicate and similar data from hadoop

New Contributor

Have a good day 

what I want to do is as the following 

lat's say I follow the example of importing data from twitter 

and person 1 & 2 write the same text 100 time 

I want to make it as a single record not to analyze the pattern 100 time

how can I do that ?

Please I need detailed steps it is my first time to work in big data

 

Thanks

 

1 REPLY 1
Highlighted

Re: removing duplicate and similar data from hadoop

Explorer
It can be done provided person 1 & 2 writes the exact same text as an example 'good day'.

Now if you can make the text as key and pass value as NullWritable, to the Map-Reduce will output the key only one time.

One caveat is that the text needs to be exactly the same as Map Reduce will treat 'Good Day', 'good Day', 'good day' as separate keys.
Don't have an account?
Coming from Hortonworks? Activate your account here