Created 10-11-2018 09:32 PM
I understand a PARTITIONED table with CLUSTERED BY clause will create groups of data files inside each partition but what if the table is created with just CLUSTERED BY into BUCKETS clause ? isnt it the same as a partitioned table since it will group the files together based on the clustered column ?
Created 10-13-2018 04:40 AM
Assume below sample data
name,age,dept
'naresh',25,'IT'
'sami',24,'ECE'
Creating a table with partitioned by(name) clustered by(age,dept) into 1 bucket, 2 partition folders inside table path
<table_path>/name=naresh/000000_0
<table_path>/name=sami/000000_0
Whereas incase of clustered by(name,age,dept) into 1 bucket, hashcode(name,age,dept) % Number of buckets
Clustered by scatters the data within buckets based on hashcode, whereas incase of partition by, u will have explicit folder per unique value of a column which gives better performance if filters applied on partition column.
If my response helped your query, accept the answer. It might help others in the community.
Created 10-12-2018 01:36 PM
can anyone address this question ?
Created 10-13-2018 04:40 AM
Assume below sample data
name,age,dept
'naresh',25,'IT'
'sami',24,'ECE'
Creating a table with partitioned by(name) clustered by(age,dept) into 1 bucket, 2 partition folders inside table path
<table_path>/name=naresh/000000_0
<table_path>/name=sami/000000_0
Whereas incase of clustered by(name,age,dept) into 1 bucket, hashcode(name,age,dept) % Number of buckets
Clustered by scatters the data within buckets based on hashcode, whereas incase of partition by, u will have explicit folder per unique value of a column which gives better performance if filters applied on partition column.
If my response helped your query, accept the answer. It might help others in the community.
Created 10-14-2018 03:06 AM
like partitioning we can give filters in the query . what about clustering ? how can the query benefit from it ?
Created 10-15-2018 06:26 AM
Bucketed tables allows much more efficient sampling than non-bucketed tables. With sampling, we can try out queries on a section of data for testing and debugging purpose when the original data sets are very huge. False positive is possible on clustered by columns, whereas applying filters on partition column is better in pruning the exact files.
Bucketing tables also helps in map-join, assuming the data is equally distributed. As it eliminates the cost of partitioning the data while joining.
Bucketing is useful with high cardinality columns, as it reduces number of small files & clusters them within specified buckets.
Earlier to HIVE-11525, bucketing wasn't able to filter files if multiple columns are clustered by.