Support Questions

aliyesami · ‎10-11-2018

I understand a PARTITIONED table with CLUSTERED BY clause will create groups of data files inside each partition but what if the table is created with just CLUSTERED BY into BUCKETS clause ? isnt it the same as a partitioned table since it will group the files together based on the clustered column ?

nramanaiah · ‎10-13-2018

Assume below sample data

name,age,dept

'naresh',25,'IT'

'sami',24,'ECE'

Creating a table with partitioned by(name) clustered by(age,dept) into 1 bucket, 2 partition folders inside table path

<table_path>/name=naresh/000000_0

<table_path>/name=sami/000000_0

Whereas incase of clustered by(name,age,dept) into 1 bucket, hashcode(name,age,dept) % Number of buckets

Clustered by scatters the data within buckets based on hashcode, whereas incase of partition by, u will have explicit folder per unique value of a column which gives better performance if filters applied on partition column.

If my response helped your query, accept the answer. It might help others in the community.

View solution in original post

aliyesami · ‎10-12-2018

can anyone address this question ?

nramanaiah · ‎10-13-2018

Assume below sample data

name,age,dept

'naresh',25,'IT'

'sami',24,'ECE'

Creating a table with partitioned by(name) clustered by(age,dept) into 1 bucket, 2 partition folders inside table path

<table_path>/name=naresh/000000_0

<table_path>/name=sami/000000_0

Whereas incase of clustered by(name,age,dept) into 1 bucket, hashcode(name,age,dept) % Number of buckets

Clustered by scatters the data within buckets based on hashcode, whereas incase of partition by, u will have explicit folder per unique value of a column which gives better performance if filters applied on partition column.

If my response helped your query, accept the answer. It might help others in the community.

aliyesami · ‎10-14-2018

you mentioned that "clustered by" scatters the data within bucket but you didn't mention whats the advantage ? and how can it be exploited ?

like partitioning we can give filters in the query . what about clustering ? how can the query benefit from it ?

nramanaiah · ‎10-15-2018

Bucketed tables allows much more efficient sampling than non-bucketed tables. With sampling, we can try out queries on a section of data for testing and debugging purpose when the original data sets are very huge. False positive is possible on clustered by columns, whereas applying filters on partition column is better in pruning the exact files.

Bucketing tables also helps in map-join, assuming the data is equally distributed. As it eliminates the cost of partitioning the data while joining.

Bucketing is useful with high cardinality columns, as it reduces number of small files & clusters them within specified buckets.

Earlier to HIVE-11525, bucketing wasn't able to filter files if multiple columns are clustered by.

Cloudera Community

Support Questions

question about clustered only table

HBASE truncate table question

NiFi cluster sandbox on Docker

Hive Temporary Tables.

Apache Ranger Admin UI Login Question

How to Create an Iceberg Table with PySpark in Clo...

Cloudera Data Platform (CDP) - Questions & Answers

Common LLAP questions answered

Cloudera DataFlow (CDF) - Questions & Answers

New Cluster Install (Design Question)

Recommendations for Using Dask Cuda Clusters in Cl...