Member since
03-20-2016
19
Posts
11
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2162 | 03-26-2016 11:57 AM |
05-18-2016
05:53 PM
If I wanted to list the top 10 tweets could I include a where clause in the Emit of the reducer method, something along the lines of: Emit(terms t, LIMIT(DESC(count sum),10)?
... View more
05-15-2016
12:34 PM
2 Kudos
"My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS." Correct "What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?" You are correct and buckets are essentially files in these partition folders. Every bucket = one file. You can find the reasoning and the uses for them here: https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html
... View more
05-01-2019
03:18 PM
Can someone tell me a scenario where PIG is only option and a scenario where HIVE is only option?
... View more
04-16-2016
02:15 PM
how do you login as admin profile? I am using the Maria_dev, it is all grey out in config, I could not change anything
... View more
03-21-2016
10:33 PM
4 Kudos
You've mentioned Python to implement TF-IDF, but unless you absolutely have to use Python for some other reason, then you can consider implementing the same algorithm using Hive SQL instead. That way, it'll run in parallel without any extra work. Take a look at the Wikipedia article on TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) Here's one sample SQL implementation of TF-IDF that you could build Hive SQL from by ignoring all the index related stuff : https://gist.github.com/sumanthprabhu/8067221
... View more
03-21-2016
10:33 AM
Hello John, I think there has been a confusion, the Jars need to be on the client/hiveserver nodes of the cluster on the local Linux file system. In /usr/hdp/<version?/hive/auxlib. If you put a jar in there you don't need to do another ADD. If you do an ADD you also need to have the jar on the local file system. This time depending what you use. If you use the hive client then on your client machine if you use beeline or JDBC then on the same machine of the hiveserver2.
... View more