Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

MapReduce for Twitter Hashtags

avatar
Contributor

Hi,

I would like to implement a MapReduce job to identify the top-N tweets from a large number of tweets presumably stored in HDFS. As you know a tweet can have multiple hashtags so this needs to be considered.

I am using the simple word count example pseudocode to get started as I am new to programming.

At a high level my Map stage reads all tweets in from HDFS and tokenises each tweet, placing a 1 beside each separate word in the tweet.

So the output from my Map would be the following key value pairs:

The 1, Quick 1, Brown 1, Fox 1, Jumps 1, Over 1, The 1, Lazy 1, Dog 1, #Lazy 1, #Dog 1

We then have the shuffle and sort phase which performs a count of the values from the pairs.

The 2, Quick 1, Brown 1, Fox 1, Jumps 1, Over 1, Lazy 1, Dog 1, #Lazy 1, #Dog 1

Before I send my Maps to the Reducer how could I specify that I am only interested in strings beginning with a '#', can I drop strings that don't begin with '#' to speed up the algorithm?

From the sample pseudocode below, could I replace 'docid' with 'tweetid' since this is the unique identifier of the tweet and 'doc' with tweet to represent the content of the tweet?

I'd appreciate it if you could help me with the pseudocode so that I can get my head around the basics.

4246-word-count.png

1 ACCEPTED SOLUTION

avatar
Master Guru

You can drop non-hashtag strings in your Mapper by emitting only hashtag terms (beginning with "#"). And yes, you can use the tweet identifier as docid, and tweet text as doc. However, if you don't emit docid (tweet-id) you will lose connecction between tweets and hashtags. You can emit (hastag, tweet-id), and in the Reducer phase count tweets for each hash-tag. In this way, you can detect most popular hastags, and the tweets using them.

View solution in original post

8 REPLIES 8

avatar
Master Guru

You can drop non-hashtag strings in your Mapper by emitting only hashtag terms (beginning with "#"). And yes, you can use the tweet identifier as docid, and tweet text as doc. However, if you don't emit docid (tweet-id) you will lose connecction between tweets and hashtags. You can emit (hastag, tweet-id), and in the Reducer phase count tweets for each hash-tag. In this way, you can detect most popular hastags, and the tweets using them.

avatar
Contributor

Thanks Predrag,

So would the following pseudocode be correct, I am only learning to program so excuse my ignorance:

Class Mapper

Method Map(tweet-id a, words b)

for all term t E words b do

Emit(term t where term begins with #, count 1)

Class Reducer

Method Reduce(term t, counts [c1,c2,....])

sum <- 0

for all count c E counts [c1,c2,c3,....] do

sum <- sum + c

Emit (term t, count sum)

avatar
Master Guru

@John Garrigan, with emit(hashtag, 1) you will be counting hashtags. With emit(hastag, twit-id) you will be also counting hash tags and also have a list of twits for each of them. And if you restrict the reducer count to 1 you will have hashtags sorted by popularity (count).

avatar
Contributor

Thanks Predrag, I though you could only send from your mapper a key value pair? If I emit hashtag and twit-id what value does the reducer use for the count?

avatar
Master Guru

You are emitting (key,value) pairs, but "value" doesn't have to be a count. If you emit (hashtag, twit-id), reducers will receive on input (hashtag, [twitid-1, twitid-2, ..., twitid-n]). You can still count the number of tweets by counting the number of entries in the input value array. But now you also have a list of twits talking about each particular hashtag. That's one way of preprocessing twits, getting ready for search, so when someone searches for a certain hashtag, you can quickly show all related twits. In the next iteration you can emit from mappers (hashtag, (twitid, retwit-count)), and in the reducer sort twits for each hashatg by their rewtit count.

avatar
Contributor

If I wanted to list the top 10 tweets could I include a where clause in the Emit of the reducer method, something along the lines of: Emit(terms t, LIMIT(DESC(count sum),10)?

avatar
Master Guru

Long ago, when it was available free on Internet as a manuscript, I found this book helpful in learning about MapReduce: Data-Intensive Text Processing with MapReduce.

avatar
Super Collaborator

@John Garrigan You can filter out the non-relevant terms in Mapper phase. Doing this will also increase your program performance as you are dropping unwanted data at early stage.

You can use Regex to identify all words start with hashtag.

Regex : ^#

Psuedocode in above comment looks good .To do more optimization , you can introduce combiner class which will reduce network traffic between mapper and reducer phase.

Combiner example code : https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduc...

Combiner concept : http://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm

Hope this helps you