Created 08-20-2020 05:31 AM
I would like to know if there were any perverse effects on using partitioned tables with a single bucket clause. This is to generate only one file in the partition. I use this because the compaction major process doesn't work very good. .I am using this solution to avoid having multiple small files in the partition. In fact, the compaction mechanism (Major) does not always allow only one file in the partition.
Created 08-24-2020 02:40 AM
If your partition is not big enough say a couple of million rows (which i see since you have 10000 partitions on 1billion so approx couple of millions of rows), then its ok to create a single bucket. Also, as long as the file size is greater than block size, having multiple files doesnt degrade the performance. Too many small files less than block size is a concern.
You should use compaction since it makes it easier for hive to skip a partition altogether.
As i said earlier, there is no best solution. You need to understand how the ad hoc queries are fired and whats the common use case. Only after that, you can take a specific path and you might to run a small POC to do some statistical analysis.
Hope this helps.
Created 08-21-2020 12:37 AM
Partitioning and bucketing are forms to improve hive performance. Neither is mandate but is good to have. The partitioning and bucketing depend a lot on how the table looks like. If the table has millions / billions of row or the table is too wide with hundreds of columns, the query performance is impacted greatly.
To answer your question, the only effect is see is performance degradation. But, it again if the table is small (my assumption ~10-15 mil) then one bucket or more than one bucket will not bring significant improvement. But, with million of rows, it always good to bucket, so the query is evaluated only on the rows within a 1/2 buckets and this results in increased performance.
When the table has billions and are wide as well, ideally it is always bucketed and partitioned both interchangeably. There is no perfect solution, it always defers depending on the scenario.
Hope this helps. If the comment helps you to find a solution or move forward, please accept it as a solution for other community members.
Created 08-21-2020 08:02 AM
My table is very huge , we have billions of rows and approximatively 10 000 partitions. my exact question is to force a single bucket in each partition with a partionning clause and a clustered by (col) bucketing 1 when creating the table. so we always have only one file in the score.
The other solution is to let the partitions fill up without a bucketing clause at creation but to compact the table to avoid full of files in the partition. I don't know which is the best solution. I think like you that we have a degradation of the performnaces during the loads because we no longer parrélize (only 1 bucket)
Created 08-24-2020 02:40 AM
If your partition is not big enough say a couple of million rows (which i see since you have 10000 partitions on 1billion so approx couple of millions of rows), then its ok to create a single bucket. Also, as long as the file size is greater than block size, having multiple files doesnt degrade the performance. Too many small files less than block size is a concern.
You should use compaction since it makes it easier for hive to skip a partition altogether.
As i said earlier, there is no best solution. You need to understand how the ad hoc queries are fired and whats the common use case. Only after that, you can take a specific path and you might to run a small POC to do some statistical analysis.
Hope this helps.