Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive Clusters/Buckets

avatar
Contributor

Hi,

I am having difficulty understanding the concept of buckets/clusters in Hive.

My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS.

What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?

1 ACCEPTED SOLUTION

avatar
Master Guru

"My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS."

Correct

"What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?"

You are correct and buckets are essentially files in these partition folders. Every bucket = one file. You can find the reasoning and the uses for them here:

https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html

View solution in original post

1 REPLY 1

avatar
Master Guru

"My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS."

Correct

"What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?"

You are correct and buckets are essentially files in these partition folders. Every bucket = one file. You can find the reasoning and the uses for them here:

https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html