About jgarrigan

jgarrigan · ‎05-18-2016

If I wanted to list the top 10 tweets could I include a where clause in the Emit of the reducer method, something along the lines of: Emit(terms t, LIMIT(DESC(count sum),10)?

jgarrigan · ‎05-18-2016

Thanks Predrag, I though you could only send from your mapper a key value pair? If I emit hashtag and twit-id what value does the reducer use for the count?

jgarrigan · ‎05-18-2016

Thanks Predrag, So would the following pseudocode be correct, I am only learning to program so excuse my ignorance: Class Mapper Method Map(tweet-id a, words b) for all term t E words b do Emit(term t where term begins with #, count 1) Class Reducer Method Reduce(term t, counts [c1,c2,....]) sum <- 0 for all count c E counts [c1,c2,c3,....] do sum <- sum + c Emit (term t, count sum)

jgarrigan · ‎05-16-2016

Hi, I would like to implement a MapReduce job to identify the top-N tweets from a large number of tweets presumably stored in HDFS. As you know a tweet can have multiple hashtags so this needs to be considered. I am using the simple word count example pseudocode to get started as I am new to programming. At a high level my Map stage reads all tweets in from HDFS and tokenises each tweet, placing a 1 beside each separate word in the tweet. So the output from my Map would be the following key value pairs: The 1, Quick 1, Brown 1, Fox 1, Jumps 1, Over 1, The 1, Lazy 1, Dog 1, #Lazy 1, #Dog 1 We then have the shuffle and sort phase which performs a count of the values from the pairs. The 2, Quick 1, Brown 1, Fox 1, Jumps 1, Over 1, Lazy 1, Dog 1, #Lazy 1, #Dog 1 Before I send my Maps to the Reducer how could I specify that I am only interested in strings beginning with a '#', can I drop strings that don't begin with '#' to speed up the algorithm? From the sample pseudocode below, could I replace 'docid' with 'tweetid' since this is the unique identifier of the tweet and 'doc' with tweet to represent the content of the tweet? I'd appreciate it if you could help me with the pseudocode so that I can get my head around the basics.

jgarrigan · ‎05-15-2016

Hi, I am having difficulty understanding the concept of buckets/clusters in Hive. My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS. What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?

jgarrigan · ‎05-14-2016

Hi, Can anyone elaborate on why pig and hive are better suited for unstructured and structured respectively? My understanding of structured data is data that follows a particular schema and after that I've very little knowledge. Is there a limitation with CSV files and variable length fields that Pig can handle easily?

jgarrigan · ‎03-27-2016

Hi Predrag, many thanks for your help this worked perfectly.

jgarrigan · ‎03-27-2016

I have a Hive table with a number of columns where column X contains a large string of text with many spaces between each word, all delimiters have been removed and all that remains are 0-9 a-z and A-Z characters. I would like to query Column X for a keyword Y and Y^ (eg Java and Javascript) and count the number of unique users from column Z that have mentioned these two words. Either of these words need to be mentioned once in column X for it to be counted. I would then like to count the total number of users that mentioned either of these two words.

jgarrigan · ‎03-26-2016

So I finally figured out my problem, it had to do with my own impatience! Although Hive said the query ran successfully and had not returned any results, it was still working away in the background and eventually spat back some results. Although the results were not correct that's for another day. I changed the settings as per the linked article above but needed an Admin profile to do so and it eventually worked.

jgarrigan · ‎03-25-2016

I came across this article and changed the recommended settings but when I now run the above query I receive no errors or results even though the status is completed. Has anyone any other suggestions please?

Online	Offline
Last Visited	‎05-18-2016 05:53 PM

Member Since	‎03-20-2016 10:32 PM
Last Visited	‎05-18-2016 05:53 PM
Posts	19
Kudos received	11

Cloudera Community

Re: Why is Hive running out of memory OOM on an Az...

Re: MapReduce for Twitter Hashtags

Re: MapReduce for Twitter Hashtags

Re: MapReduce for Twitter Hashtags

MapReduce for Twitter Hashtags

Hive Clusters/Buckets

Structured Unstructured Data for Pig and Hive

Re: Find a keyword in a column in Hive table

Find a keyword in a column in Hive table

Re: Why is Hive running out of memory OOM on an Az...

Re: Why is Hive running out of memory OOM on an Az...