Created 05-15-2016 11:25 AM
Hi,
I am having difficulty understanding the concept of buckets/clusters in Hive.
My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS.
What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?
Created 05-15-2016 12:34 PM
"My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS."
Correct
"What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?"
You are correct and buckets are essentially files in these partition folders. Every bucket = one file. You can find the reasoning and the uses for them here:
https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html
Created 05-15-2016 12:34 PM
"My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS."
Correct
"What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?"
You are correct and buckets are essentially files in these partition folders. Every bucket = one file. You can find the reasoning and the uses for them here:
https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html