Support Questions
Find answers, ask questions, and share your expertise

MapReduce for Twitter Hashtags

Solved Go to solution

MapReduce for Twitter Hashtags

Explorer

Hi,

I would like to implement a MapReduce job to identify the top-N tweets from a large number of tweets presumably stored in HDFS. As you know a tweet can have multiple hashtags so this needs to be considered.

I am using the simple word count example pseudocode to get started as I am new to programming.

At a high level my Map stage reads all tweets in from HDFS and tokenises each tweet, placing a 1 beside each separate word in the tweet.

So the output from my Map would be the following key value pairs:

The 1, Quick 1, Brown 1, Fox 1, Jumps 1, Over 1, The 1, Lazy 1, Dog 1, #Lazy 1, #Dog 1

We then have the shuffle and sort phase which performs a count of the values from the pairs.

The 2, Quick 1, Brown 1, Fox 1, Jumps 1, Over 1, Lazy 1, Dog 1, #Lazy 1, #Dog 1

Before I send my Maps to the Reducer how could I specify that I am only interested in strings beginning with a '#', can I drop strings that don't begin with '#' to speed up the algorithm?

From the sample pseudocode below, could I replace 'docid' with 'tweetid' since this is the unique identifier of the tweet and 'doc' with tweet to represent the content of the tweet?

I'd appreciate it if you could help me with the pseudocode so that I can get my head around the basics.

4246-word-count.png

1 ACCEPTED SOLUTION

Accepted Solutions

Re: MapReduce for Twitter Hashtags

You can drop non-hashtag strings in your Mapper by emitting only hashtag terms (beginning with "#"). And yes, you can use the tweet identifier as docid, and tweet text as doc. However, if you don't emit docid (tweet-id) you will lose connecction between tweets and hashtags. You can emit (hastag, tweet-id), and in the Reducer phase count tweets for each hash-tag. In this way, you can detect most popular hastags, and the tweets using them.

View solution in original post

8 REPLIES 8

Re: MapReduce for Twitter Hashtags

You can drop non-hashtag strings in your Mapper by emitting only hashtag terms (beginning with "#"). And yes, you can use the tweet identifier as docid, and tweet text as doc. However, if you don't emit docid (tweet-id) you will lose connecction between tweets and hashtags. You can emit (hastag, tweet-id), and in the Reducer phase count tweets for each hash-tag. In this way, you can detect most popular hastags, and the tweets using them.

View solution in original post

Re: MapReduce for Twitter Hashtags

Explorer

Thanks Predrag,

So would the following pseudocode be correct, I am only learning to program so excuse my ignorance:

Class Mapper

Method Map(tweet-id a, words b)

for all term t E words b do

Emit(term t where term begins with #, count 1)

Class Reducer

Method Reduce(term t, counts [c1,c2,....])

sum <- 0

for all count c E counts [c1,c2,c3,....] do

sum <- sum + c

Emit (term t, count sum)

Re: MapReduce for Twitter Hashtags

@John Garrigan, with emit(hashtag, 1) you will be counting hashtags. With emit(hastag, twit-id) you will be also counting hash tags and also have a list of twits for each of them. And if you restrict the reducer count to 1 you will have hashtags sorted by popularity (count).

Re: MapReduce for Twitter Hashtags

Explorer

Thanks Predrag, I though you could only send from your mapper a key value pair? If I emit hashtag and twit-id what value does the reducer use for the count?

Re: MapReduce for Twitter Hashtags

You are emitting (key,value) pairs, but "value" doesn't have to be a count. If you emit (hashtag, twit-id), reducers will receive on input (hashtag, [twitid-1, twitid-2, ..., twitid-n]). You can still count the number of tweets by counting the number of entries in the input value array. But now you also have a list of twits talking about each particular hashtag. That's one way of preprocessing twits, getting ready for search, so when someone searches for a certain hashtag, you can quickly show all related twits. In the next iteration you can emit from mappers (hashtag, (twitid, retwit-count)), and in the reducer sort twits for each hashatg by their rewtit count.

Re: MapReduce for Twitter Hashtags

Explorer

If I wanted to list the top 10 tweets could I include a where clause in the Emit of the reducer method, something along the lines of: Emit(terms t, LIMIT(DESC(count sum),10)?

Re: MapReduce for Twitter Hashtags

Long ago, when it was available free on Internet as a manuscript, I found this book helpful in learning about MapReduce: Data-Intensive Text Processing with MapReduce.

Re: MapReduce for Twitter Hashtags

Expert Contributor

@John Garrigan You can filter out the non-relevant terms in Mapper phase. Doing this will also increase your program performance as you are dropping unwanted data at early stage.

You can use Regex to identify all words start with hashtag.

Regex : ^#

Psuedocode in above comment looks good .To do more optimization , you can introduce combiner class which will reduce network traffic between mapper and reducer phase.

Combiner example code : https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduc...

Combiner concept : http://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm

Hope this helps you