Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive Clusters/Buckets

Solved Go to solution
Highlighted

Hive Clusters/Buckets

New Contributor

Hi,

I am having difficulty understanding the concept of buckets/clusters in Hive.

My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS.

What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Hive Clusters/Buckets

"My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS."

Correct

"What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?"

You are correct and buckets are essentially files in these partition folders. Every bucket = one file. You can find the reasoning and the uses for them here:

https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html

1 REPLY 1

Re: Hive Clusters/Buckets

"My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS."

Correct

"What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?"

You are correct and buckets are essentially files in these partition folders. Every bucket = one file. You can find the reasoning and the uses for them here:

https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html

Don't have an account?
Coming from Hortonworks? Activate your account here