About jgarrigan

sow · ‎05-27-2020

hi, Were you able to find a solution for this ?

jgarrigan · ‎05-18-2016

If I wanted to list the top 10 tweets could I include a where clause in the Emit of the reducer method, something along the lines of: Emit(terms t, LIMIT(DESC(count sum),10)?

bleonhardi · ‎05-15-2016

"My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS." Correct "What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?" You are correct and buckets are essentially files in these partition folders. Every bucket = one file. You can find the reasoning and the uses for them here: https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html

arujit_das · ‎05-01-2019

Can someone tell me a scenario where PIG is only option and a scenario where HIVE is only option?

chen_zhi · ‎04-16-2016

how do you login as admin profile? I am using the Maria_dev, it is all grey out in config, I could not change anything

paul_boal · ‎03-21-2016

You've mentioned Python to implement TF-IDF, but unless you absolutely have to use Python for some other reason, then you can consider implementing the same algorithm using Hive SQL instead. That way, it'll run in parallel without any extra work. Take a look at the Wikipedia article on TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) Here's one sample SQL implementation of TF-IDF that you could build Hive SQL from by ignoring all the index related stuff : https://gist.github.com/sumanthprabhu/8067221

bleonhardi · ‎03-21-2016

Hello John, I think there has been a confusion, the Jars need to be on the client/hiveserver nodes of the cluster on the local Linux file system. In /usr/hdp/<version?/hive/auxlib. If you put a jar in there you don't need to do another ADD. If you do an ADD you also need to have the jar on the local file system. This time depending what you use. If you use the hive client then on your client machine if you use beeline or JDBC then on the same machine of the hiveserver2.

Online	Offline
Last Visited	‎05-18-2016 05:53 PM

Member Since	‎03-20-2016 10:32 PM
Last Visited	‎05-18-2016 05:53 PM
Posts	19
Kudos received	11

Cloudera Community

Re: Why is Hive running out of memory OOM on an Az...

Re: Find a keyword in a column in Hive table

Re: MapReduce for Twitter Hashtags

Re: Hive Clusters/Buckets

Re: Structured Unstructured Data for Pig and Hive

Re: Why is Hive running out of memory OOM on an Az...

Re: Running Python Scripts on data in HDFS

Re: Loading unstructured CSV files to Hive