Member since
09-22-2016
1
Post
1
Kudos Received
0
Solutions
09-22-2016
01:55 PM
1 Kudo
I am attempting to find the top 3 words in a given document using Hadoop in order to gain more experience with Hadoop. My output gives me an unexpected output as opposed to simply being the top 3 words I am looking for. The primary document is in a .txt format. #!/usr/bin/env python
import sys, time
def parseRecords():
for line in sys.stdin:
line = line.strip('\n')
yield line.split()
def mapper():
for words in parseRecords():
for w in words:
print '%s\t%s' % (w,1)
if __name__=='__main__':
mapper()
#!/usr/bin/env python
import itertools, operator, sys
from collections import Counter
cnt = Counter()
def parsePairs():
for line in sys.stdin:
yield tuple(line.strip('\n').split('\t'))
def reducer():
for key, pairs in itertools.groupby(parsePairs(),
operator.itemgetter(0)):
count = sum(int(i[1]) for i in pairs)
cnt[key] += count
for x, y in cnt.most_common(3):
print '%s\t%s' % (x, y)
if __name__=='__main__':
reducer()
... View more
Labels: