Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

bucketing table with just one bucket vs partionned table

avatar
New Contributor

I would like to know if there were any perverse effects on using partitioned tables with a single bucket clause. This is to generate only one file in the partition. I use this because the compaction major process doesn't work very good. .I am using this solution to avoid having multiple small files in the partition. In fact, the compaction mechanism (Major) does not always allow only one file in the partition.

1 ACCEPTED SOLUTION

avatar
Contributor

If your partition is not big enough say a couple of million rows (which i see since you have 10000 partitions on 1billion so approx couple of millions of rows), then its ok to create a single bucket. Also, as long as the file size is greater than block size, having multiple files doesnt degrade the performance. Too many small files less than block size is a concern. 

You should use compaction since it makes it easier for hive to skip a partition altogether.

As i said earlier, there is no best solution. You need to understand how the ad hoc queries are fired and whats the common use case. Only after that, you can take a specific path and you might to run a small POC to do some statistical analysis.

Hope this helps.

View solution in original post

3 REPLIES 3

avatar
Contributor

Partitioning and bucketing are forms to improve hive performance. Neither is mandate but is good to have. The partitioning and bucketing depend a lot on how the table looks like. If the table has millions / billions of row or the table is too wide with hundreds of columns, the query performance is impacted greatly.

To answer your question, the only effect is see is performance degradation. But, it again if the table is small (my assumption ~10-15 mil) then one bucket or more than one bucket will not bring significant improvement. But, with million of rows, it always good to bucket, so the query is evaluated only on the rows within a 1/2 buckets and this results in increased performance.

When the table has billions and are wide as well, ideally it is always bucketed and partitioned both interchangeably. There is no perfect solution, it always defers depending on the scenario.

Hope this helps. If the comment helps you to find a solution or move forward, please accept it as a solution for other community members.

avatar
New Contributor

My table is very huge , we have billions of rows and approximatively 10 000 partitions. my exact question is to force a single bucket in each partition with a partionning clause and a clustered by (col) bucketing 1 when creating the table. so we always have only one file in the score.

 

The other solution is to let the partitions fill up without a bucketing clause at creation but to compact the table to avoid full of files in the partition. I don't know which is the best solution. I think like you that we have a degradation of the performnaces during the loads because we no longer parrélize (only 1 bucket)

avatar
Contributor

If your partition is not big enough say a couple of million rows (which i see since you have 10000 partitions on 1billion so approx couple of millions of rows), then its ok to create a single bucket. Also, as long as the file size is greater than block size, having multiple files doesnt degrade the performance. Too many small files less than block size is a concern. 

You should use compaction since it makes it easier for hive to skip a partition altogether.

As i said earlier, there is no best solution. You need to understand how the ad hoc queries are fired and whats the common use case. Only after that, you can take a specific path and you might to run a small POC to do some statistical analysis.

Hope this helps.