Member since
03-20-2016
19
Posts
11
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2176 | 03-26-2016 11:57 AM |
05-18-2016
05:53 PM
If I wanted to list the top 10 tweets could I include a where clause in the Emit of the reducer method, something along the lines of: Emit(terms t, LIMIT(DESC(count sum),10)?
... View more
05-18-2016
02:46 PM
Thanks Predrag, I though you could only send from your mapper a key value pair? If I emit hashtag and twit-id what value does the reducer use for the count?
... View more
05-18-2016
11:44 AM
Thanks Predrag,
So would the following pseudocode be correct, I am only learning to program so excuse my ignorance:
Class Mapper
Method Map(tweet-id a, words b)
for all term t E words b do
Emit(term t where term begins with #, count 1) Class Reducer Method Reduce(term t, counts [c1,c2,....]) sum <- 0 for all count c E counts [c1,c2,c3,....] do sum <- sum + c Emit (term t, count sum)
... View more
05-16-2016
08:39 PM
Hi, I would like to implement a MapReduce job to identify the top-N tweets from a large number of tweets presumably stored in HDFS. As you know a tweet can have multiple hashtags so this needs to be considered. I am using the simple word count example pseudocode to get started as I am new to programming. At a high level my Map stage reads all tweets in from HDFS and tokenises each tweet, placing a 1 beside each separate word in the tweet. So the output from my Map would be the following key value pairs: The 1, Quick 1, Brown 1, Fox 1, Jumps 1, Over 1, The 1, Lazy 1, Dog 1, #Lazy 1, #Dog 1 We then have the shuffle and sort phase which performs a count of the values from the pairs. The 2, Quick 1, Brown 1, Fox 1, Jumps 1, Over 1, Lazy 1, Dog 1, #Lazy 1, #Dog 1 Before I send my Maps to the Reducer how could I specify that I am only interested in strings beginning with a '#', can I drop strings that don't begin with '#' to speed up the algorithm? From the sample pseudocode below, could I replace 'docid' with 'tweetid' since this is the unique identifier of the tweet and 'doc' with tweet to represent the content of the tweet? I'd appreciate it if you could help me with the pseudocode so that I can get my head around the basics.
... View more
Labels:
- Labels:
-
Apache Hadoop
05-15-2016
11:25 AM
Hi, I am having difficulty understanding the concept of buckets/clusters in Hive. My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS. What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?
... View more
Labels:
- Labels:
-
Apache Hive
05-14-2016
10:32 AM
2 Kudos
Hi, Can anyone elaborate on why pig and hive are better suited for unstructured and structured respectively? My understanding of structured data is data that follows a particular schema and after that I've very little knowledge. Is there a limitation with CSV files and variable length fields that Pig can handle easily?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Pig
03-27-2016
12:25 AM
1 Kudo
I have a Hive table with a number of columns where column X contains a large string of text with many spaces between each word, all delimiters have been removed and all that remains are 0-9 a-z and A-Z characters.
I would like to query Column X for a keyword Y and Y^ (eg Java and Javascript) and count the number of unique users from column Z that have mentioned these two words. Either of these words need to be mentioned once in column X for it to be counted. I would then like to count the total number of users that mentioned either of these two words.
... View more
Labels:
- Labels:
-
Apache Hive
03-26-2016
11:57 AM
So I finally figured out my problem, it had to do with my own impatience! Although Hive said the query ran successfully and had not returned any results, it was still working away in the background and eventually spat back some results. Although the results were not correct that's for another day. I changed the settings as per the linked article above but needed an Admin profile to do so and it eventually worked.
... View more
03-25-2016
09:51 PM
I came across this article and changed the recommended settings but when I now run the above query I receive no errors or results even though the status is completed. Has anyone any other suggestions please?
... View more