Support Questions

Find answers, ask questions, and share your expertise

question about clustered only table

avatar
Super Collaborator

I understand a PARTITIONED table with CLUSTERED BY clause will create groups of data files inside each partition but what if the table is created with just CLUSTERED BY into BUCKETS clause ? isnt it the same as a partitioned table since it will group the files together based on the clustered column ?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Assume below sample data

name,age,dept

'naresh',25,'IT'

'sami',24,'ECE'

Creating a table with partitioned by(name) clustered by(age,dept) into 1 bucket, 2 partition folders inside table path

<table_path>/name=naresh/000000_0

<table_path>/name=sami/000000_0

Whereas incase of clustered by(name,age,dept) into 1 bucket, hashcode(name,age,dept) % Number of buckets

Clustered by scatters the data within buckets based on hashcode, whereas incase of partition by, u will have explicit folder per unique value of a column which gives better performance if filters applied on partition column.

If my response helped your query, accept the answer. It might help others in the community.

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

can anyone address this question ?

avatar
Expert Contributor

Assume below sample data

name,age,dept

'naresh',25,'IT'

'sami',24,'ECE'

Creating a table with partitioned by(name) clustered by(age,dept) into 1 bucket, 2 partition folders inside table path

<table_path>/name=naresh/000000_0

<table_path>/name=sami/000000_0

Whereas incase of clustered by(name,age,dept) into 1 bucket, hashcode(name,age,dept) % Number of buckets

Clustered by scatters the data within buckets based on hashcode, whereas incase of partition by, u will have explicit folder per unique value of a column which gives better performance if filters applied on partition column.

If my response helped your query, accept the answer. It might help others in the community.

avatar
Super Collaborator
you mentioned that "clustered by" scatters the data within bucket but you didn't mention whats the advantage ? and how can it be exploited ?

like partitioning we can give filters in the query . what about clustering ? how can the query benefit from it ?

avatar
Expert Contributor

Bucketed tables allows much more efficient sampling than non-bucketed tables. With sampling, we can try out queries on a section of data for testing and debugging purpose when the original data sets are very huge. False positive is possible on clustered by columns, whereas applying filters on partition column is better in pruning the exact files.

Bucketing tables also helps in map-join, assuming the data is equally distributed. As it eliminates the cost of partitioning the data while joining.

Bucketing is useful with high cardinality columns, as it reduces number of small files & clusters them within specified buckets.

Earlier to HIVE-11525, bucketing wasn't able to filter files if multiple columns are clustered by.