Created 03-15-2016 03:23 PM
I have been searching on this for a while now but no clear answer yet.
What are the factors to be considered while deciding the number of buckets?
One factor could be the block size itself as each bucket is a separate file in HDFS. The file size should be at least the same as the block size.The other factor could be the volume of data. In fact, these two factors go together.
At the time of table creation, the data volume may not be known. In such case, what is the recommended number of buckets? And as and when more data gets ingested, it may be that the number of buckets chosen at the time table of table creation may have to be increased. Is it possible to increase the number of buckets periodically so that the data gets balanced?
Overall, what is the best practice?
Created 03-15-2016 03:53 PM
Depends what you are trying to do.
- One thing buckets are used for is to increase load performance
Essentially when you load data you often do not want one load per mapper ( especially for partitioned loads because this results in small files ), buckets are a good way to define the number of reducers running. So if your cluster has 40 task slots and you want the fastest ORC creation performance possible you would want 40 buckets. ( Or distribute your data by 40 and set number of reducers )
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
- SELECT performance ( predicate pushdown )
Buckets can help with the predicate pushdown since every value belonging to one value will end up in one bucket. So if you bucket by 31 days and filter for one day Hive will be able to more or less disregard 30 buckets. Obviously this doesn't need to be good since you often WANT parallel execution like aggregations. So it depends on your query if this is good. It might be better to sort by day and bucket by something like customer id if you have to have buckets for some of the other reasons.
- Join Performance ( bucket join )
Buckets can lead to efficient joins if both joined tables are bucketed on the join key since he only needs to join bucket with bucket. This was big in the old times but is not that applicable anymore with cost based optimization in newere Hive versions ( since the optimizer already is very good at choosing mapside vs shuffle join and a bucket join can actually stop him from using the better one.
- Sampling performance
Some sample operations can get faster with buckets.
So to summarize buckets are a bit of an older concept and I wouldn't use them unless I have a clear case for it. The join argument is not that applicable anymore, the increased load performance also is not always relevant since you normally load single partitions where a map only load is often best. Select pushdown can be enhanced but also hindered depending how you do it and a SORT by is normally better during load ( see document ). And I think sampling is a bit niche.
So all in all avoid if you don't know too well what you are doing, and regarding size? 1-10 blocks is a pretty good size for hadoop files. Buckets should not be much smaller ( unless you want very fast answer rates on small amounts of data no rule without exceptions )
Created 03-30-2016 03:03 PM
Yes that is the point. You still get 30 files per partition. Which should be fine. But loading a couple terabyte with 30 reducers will take forever. ( In contrast to loading a single partition where 30 writers would be plenty )
Created 12-05-2017 11:06 PM
Hello @Benjamin,
What would be the best way to convert a very large non-bucketed table into a bucketed table (8 buckets and date wise partitioned table, say 1700 partitions currently).
Every time I run my query to insert.
INSERT INTO NEW_BUCKETED_TABLE PARTITION(DATE_KEY)SELECT ALL_COLUMNS,PARTITION_COLUMN FROM NON_BUCKETD_TABLE;
It always creates no. of reducers = no. of buckets, and start failing after some time as size of reducers is limited, and data is very large.
How to tackle this problem.
Thanks,
Manu
Created 07-23-2018 08:47 AM
@Benjamin Leonhardi @Manu Batham
We are also facing the same problem, is there any solution to this.
I am also facing similar problem, is there any solution for this.
Created 08-09-2018 09:33 AM
optimal bucket number is ( B * HashTableSize of Table ) / Total Memory of Node, where B=1.01