Hi,
I am having difficulty understanding the concept of buckets/clusters in Hive.
My understanding so far is that partitioning a table optimises the performance of queries such that rather than performing the query on the entire table it performs the query only on the partition of interest e.g. find employee details where state = NYC. It will just query the NYC partition and return the employee details, correct? These partitions are stored in separate directories/files in HDFS.
What is a bucket and why would one use them rather than partitions? I take it a bucket and cluster are the same beast just that you use "clusteredby" to create the buckets?