Created on 05-16-2016 08:39 PM - edited 08-18-2019 05:57 AM
Hi,
I would like to implement a MapReduce job to identify the top-N tweets from a large number of tweets presumably stored in HDFS. As you know a tweet can have multiple hashtags so this needs to be considered.
I am using the simple word count example pseudocode to get started as I am new to programming.
At a high level my Map stage reads all tweets in from HDFS and tokenises each tweet, placing a 1 beside each separate word in the tweet.
So the output from my Map would be the following key value pairs:
The 1, Quick 1, Brown 1, Fox 1, Jumps 1, Over 1, The 1, Lazy 1, Dog 1, #Lazy 1, #Dog 1
We then have the shuffle and sort phase which performs a count of the values from the pairs.
The 2, Quick 1, Brown 1, Fox 1, Jumps 1, Over 1, Lazy 1, Dog 1, #Lazy 1, #Dog 1
Before I send my Maps to the Reducer how could I specify that I am only interested in strings beginning with a '#', can I drop strings that don't begin with '#' to speed up the algorithm?
From the sample pseudocode below, could I replace 'docid' with 'tweetid' since this is the unique identifier of the tweet and 'doc' with tweet to represent the content of the tweet?
I'd appreciate it if you could help me with the pseudocode so that I can get my head around the basics.
Created 05-16-2016 11:00 PM
You can drop non-hashtag strings in your Mapper by emitting only hashtag terms (beginning with "#"). And yes, you can use the tweet identifier as docid, and tweet text as doc. However, if you don't emit docid (tweet-id) you will lose connecction between tweets and hashtags. You can emit (hastag, tweet-id), and in the Reducer phase count tweets for each hash-tag. In this way, you can detect most popular hastags, and the tweets using them.
Created 05-16-2016 11:00 PM
You can drop non-hashtag strings in your Mapper by emitting only hashtag terms (beginning with "#"). And yes, you can use the tweet identifier as docid, and tweet text as doc. However, if you don't emit docid (tweet-id) you will lose connecction between tweets and hashtags. You can emit (hastag, tweet-id), and in the Reducer phase count tweets for each hash-tag. In this way, you can detect most popular hastags, and the tweets using them.
Created 05-18-2016 11:44 AM
Thanks Predrag,
So would the following pseudocode be correct, I am only learning to program so excuse my ignorance:
Class Mapper
Method Map(tweet-id a, words b)
for all term t E words b do
Emit(term t where term begins with #, count 1)
Class Reducer
Method Reduce(term t, counts [c1,c2,....])
sum <- 0
for all count c E counts [c1,c2,c3,....] do
sum <- sum + c
Emit (term t, count sum)
Created 05-18-2016 01:16 PM
@John Garrigan, with emit(hashtag, 1) you will be counting hashtags. With emit(hastag, twit-id) you will be also counting hash tags and also have a list of twits for each of them. And if you restrict the reducer count to 1 you will have hashtags sorted by popularity (count).
Created 05-18-2016 02:46 PM
Thanks Predrag, I though you could only send from your mapper a key value pair? If I emit hashtag and twit-id what value does the reducer use for the count?
Created 05-18-2016 03:13 PM
You are emitting (key,value) pairs, but "value" doesn't have to be a count. If you emit (hashtag, twit-id), reducers will receive on input (hashtag, [twitid-1, twitid-2, ..., twitid-n]). You can still count the number of tweets by counting the number of entries in the input value array. But now you also have a list of twits talking about each particular hashtag. That's one way of preprocessing twits, getting ready for search, so when someone searches for a certain hashtag, you can quickly show all related twits. In the next iteration you can emit from mappers (hashtag, (twitid, retwit-count)), and in the reducer sort twits for each hashatg by their rewtit count.
Created 05-18-2016 05:53 PM
If I wanted to list the top 10 tweets could I include a where clause in the Emit of the reducer method, something along the lines of: Emit(terms t, LIMIT(DESC(count sum),10)?
Created 05-18-2016 03:25 PM
Long ago, when it was available free on Internet as a manuscript, I found this book helpful in learning about MapReduce: Data-Intensive Text Processing with MapReduce.
Created 05-18-2016 12:32 PM
@John Garrigan You can filter out the non-relevant terms in Mapper phase. Doing this will also increase your program performance as you are dropping unwanted data at early stage.
You can use Regex to identify all words start with hashtag.
Regex : ^#
Psuedocode in above comment looks good .To do more optimization , you can introduce combiner class which will reduce network traffic between mapper and reducer phase.
Combiner example code : https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduc...
Combiner concept : http://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm
Hope this helps you