Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

question about clustered only table

SOLVED Go to solution
Highlighted

question about clustered only table

Master Collaborator

I understand a PARTITIONED table with CLUSTERED BY clause will create groups of data files inside each partition but what if the table is created with just CLUSTERED BY into BUCKETS clause ? isnt it the same as a partitioned table since it will group the files together based on the clustered column ?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: question about clustered only table

Expert Contributor

Assume below sample data

name,age,dept

'naresh',25,'IT'

'sami',24,'ECE'

Creating a table with partitioned by(name) clustered by(age,dept) into 1 bucket, 2 partition folders inside table path

<table_path>/name=naresh/000000_0

<table_path>/name=sami/000000_0

Whereas incase of clustered by(name,age,dept) into 1 bucket, hashcode(name,age,dept) % Number of buckets

Clustered by scatters the data within buckets based on hashcode, whereas incase of partition by, u will have explicit folder per unique value of a column which gives better performance if filters applied on partition column.

If my response helped your query, accept the answer. It might help others in the community.

4 REPLIES 4

Re: question about clustered only table

Master Collaborator

can anyone address this question ?

Re: question about clustered only table

Expert Contributor

Assume below sample data

name,age,dept

'naresh',25,'IT'

'sami',24,'ECE'

Creating a table with partitioned by(name) clustered by(age,dept) into 1 bucket, 2 partition folders inside table path

<table_path>/name=naresh/000000_0

<table_path>/name=sami/000000_0

Whereas incase of clustered by(name,age,dept) into 1 bucket, hashcode(name,age,dept) % Number of buckets

Clustered by scatters the data within buckets based on hashcode, whereas incase of partition by, u will have explicit folder per unique value of a column which gives better performance if filters applied on partition column.

If my response helped your query, accept the answer. It might help others in the community.

Re: question about clustered only table

Master Collaborator
you mentioned that "clustered by" scatters the data within bucket but you didn't mention whats the advantage ? and how can it be exploited ?

like partitioning we can give filters in the query . what about clustering ? how can the query benefit from it ?

Re: question about clustered only table

Expert Contributor

Bucketed tables allows much more efficient sampling than non-bucketed tables. With sampling, we can try out queries on a section of data for testing and debugging purpose when the original data sets are very huge. False positive is possible on clustered by columns, whereas applying filters on partition column is better in pruning the exact files.

Bucketing tables also helps in map-join, assuming the data is equally distributed. As it eliminates the cost of partitioning the data while joining.

Bucketing is useful with high cardinality columns, as it reduces number of small files & clusters them within specified buckets.

Earlier to HIVE-11525, bucketing wasn't able to filter files if multiple columns are clustered by.