Created 03-15-2016 03:23 PM
I have been searching on this for a while now but no clear answer yet.
What are the factors to be considered while deciding the number of buckets?
One factor could be the block size itself as each bucket is a separate file in HDFS. The file size should be at least the same as the block size.The other factor could be the volume of data. In fact, these two factors go together.
At the time of table creation, the data volume may not be known. In such case, what is the recommended number of buckets? And as and when more data gets ingested, it may be that the number of buckets chosen at the time table of table creation may have to be increased. Is it possible to increase the number of buckets periodically so that the data gets balanced?
Overall, what is the best practice?
Created 03-15-2016 03:53 PM
Depends what you are trying to do.
- One thing buckets are used for is to increase load performance
Essentially when you load data you often do not want one load per mapper ( especially for partitioned loads because this results in small files ), buckets are a good way to define the number of reducers running. So if your cluster has 40 task slots and you want the fastest ORC creation performance possible you would want 40 buckets. ( Or distribute your data by 40 and set number of reducers )
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
- SELECT performance ( predicate pushdown )
Buckets can help with the predicate pushdown since every value belonging to one value will end up in one bucket. So if you bucket by 31 days and filter for one day Hive will be able to more or less disregard 30 buckets. Obviously this doesn't need to be good since you often WANT parallel execution like aggregations. So it depends on your query if this is good. It might be better to sort by day and bucket by something like customer id if you have to have buckets for some of the other reasons.
- Join Performance ( bucket join )
Buckets can lead to efficient joins if both joined tables are bucketed on the join key since he only needs to join bucket with bucket. This was big in the old times but is not that applicable anymore with cost based optimization in newere Hive versions ( since the optimizer already is very good at choosing mapside vs shuffle join and a bucket join can actually stop him from using the better one.
- Sampling performance
Some sample operations can get faster with buckets.
So to summarize buckets are a bit of an older concept and I wouldn't use them unless I have a clear case for it. The join argument is not that applicable anymore, the increased load performance also is not always relevant since you normally load single partitions where a map only load is often best. Select pushdown can be enhanced but also hindered depending how you do it and a SORT by is normally better during load ( see document ). And I think sampling is a bit niche.
So all in all avoid if you don't know too well what you are doing, and regarding size? 1-10 blocks is a pretty good size for hadoop files. Buckets should not be much smaller ( unless you want very fast answer rates on small amounts of data no rule without exceptions )
Created 03-15-2016 03:53 PM
Depends what you are trying to do.
- One thing buckets are used for is to increase load performance
Essentially when you load data you often do not want one load per mapper ( especially for partitioned loads because this results in small files ), buckets are a good way to define the number of reducers running. So if your cluster has 40 task slots and you want the fastest ORC creation performance possible you would want 40 buckets. ( Or distribute your data by 40 and set number of reducers )
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
- SELECT performance ( predicate pushdown )
Buckets can help with the predicate pushdown since every value belonging to one value will end up in one bucket. So if you bucket by 31 days and filter for one day Hive will be able to more or less disregard 30 buckets. Obviously this doesn't need to be good since you often WANT parallel execution like aggregations. So it depends on your query if this is good. It might be better to sort by day and bucket by something like customer id if you have to have buckets for some of the other reasons.
- Join Performance ( bucket join )
Buckets can lead to efficient joins if both joined tables are bucketed on the join key since he only needs to join bucket with bucket. This was big in the old times but is not that applicable anymore with cost based optimization in newere Hive versions ( since the optimizer already is very good at choosing mapside vs shuffle join and a bucket join can actually stop him from using the better one.
- Sampling performance
Some sample operations can get faster with buckets.
So to summarize buckets are a bit of an older concept and I wouldn't use them unless I have a clear case for it. The join argument is not that applicable anymore, the increased load performance also is not always relevant since you normally load single partitions where a map only load is often best. Select pushdown can be enhanced but also hindered depending how you do it and a SORT by is normally better during load ( see document ). And I think sampling is a bit niche.
So all in all avoid if you don't know too well what you are doing, and regarding size? 1-10 blocks is a pretty good size for hadoop files. Buckets should not be much smaller ( unless you want very fast answer rates on small amounts of data no rule without exceptions )
Created 03-15-2016 09:44 PM
@Benjamin Leonhardi, On select Performance, which version of hive you are referring to? In believe you are talking about data pruning (I posted a question related to that).On the number of buckets, i am not sure i understood it well. Going by your customer id example for bucketing, how would the number of buckets be decided?
Created 03-16-2016 11:08 AM
Not sure what you mean with data pruning. Hive does partition pruning which is not this. It currently doesn't do bucket pruning even though I have seen attempts to implement this. What I mean is predicate push down. I.e. The ability of ORC files to skip stripes of data based on min/max values and bloom filters. The difference between pruning ( happens in optimizer) and predicate push down ( happens in task) is that PD needs to at least schedule the task. If you bucket by day you know that most buckets can be pretty much skipped completely. Even if they close immediately. However that often doesn't help since you want parallel processing ( the slowest task defines the query). So a sort by during insert is most of the tkme better. See the ppt I linked to for details.
Regarding number of buckets it again depends what you want. People always want simple rules but there aren't any. It depends on your data characteristics. For example customer id just distributes more or less equally between buckets. You only get advantages during load since you can decide the number of reducers. If you have 50 slots in yarn 50 buckets would result in the fastest load. However if the data volume is too small that might be bad. Tradeoff. Or if you want to sample by customer ID I.e. Only query some customers. Buckets might help. Depending on how much you want to sample. Or if you want a bucket join they might help. But again no real rule for number of buckets. Depends on how many map tasks you want to do the join.
Buckets are something that should be done for a concrete problem not just because you think you should have them. Normally I would not use them.
Created 03-25-2016 07:38 PM
@Benjamin Leonhardi I went through your slides and got few questions around that. Since its related to bucketing, partitioning, I think it makes sense to continue in the same thread itself.
Created 03-25-2016 08:09 PM
Hi
Standard Load: The daily, hourly, weekly load you normally do during operations. This is different to a initial load since it normally loads in one partition.
Mappers/Small files: that is correct. If you have lots of target partitions in your data you get lots of small files. ( for example 30 country partitions and 100 blocks of input data will result in 3000 target files each of them roughly the size of 4MB * 128MB / 30 . That is the reason you should use reducers in that case.
"What is the default key used for distribution if you don’t use the distribute by clause?"
If you don't use that clause yo either have just mappers ( and no distribution ) or some other part of the query that defines the reducer like a group by, join, or the sort.import setting discussed..
" Slide13"
That is just an overview of the three approaches after ( bucketed, optimized or manual ). The trick is to distribute by a key that fits to your partitions. To make sure each reducer only writes to one partition. Hash conflict means that randomly two partition keys may end up in the same reducer. distribute by essentially distributes the distribute key randomly ( hashed ) among reducers.
"Slide14"
You have one reducer for each bucket in all partitions. This means if you have 3000 partitions the ORC writer needs to keep 3000 buffers. Thats the reason for optimized.
"Optimized"
thats the reason optimized is different from normal bucket load. You want many reducers ( otherwise the load is slow ). But not too many ( otherwise you get small files) . Also if each reducer writes to one partition only you have optimal performance/file size ratio and can tune the file size through the number of reducers.
"Sort" Not true when you sort you make sure that all values belonging together will end up besides each other and you can skip all blocks but the one containing the value.
Created 03-25-2016 09:47 PM
@Benjamin Leonhardi, Thanks. Just one last set of questions
1. Sort by is only sorting within a reducer. So it you have 10 reducers, it ends up in 10 different ORC files. If you apply sort by on column C1, it may still happen that the same C1 may appear in 10 files unless you distribute by C1. But within each of those files, sorting may help to skip blocks.Am i right?
2. Does ORC maintain index at the block level or the stripe level? (as per slide 6 it looks like block level but as per slide 4, its at the stripe level). If its at the stripe level, it can skip the stripe but if a stripe has to be read, it has to read the entire stripe?
3.And on "Optimized", I understand in terms of the performance but still it has more reducers than the normal load, so how does it fix the small file problem?
4. May be PPD is only for ORC format, but the other concepts of partitioning, bucketing, optimized apply to other formats as well?
Created 03-26-2016 12:23 AM
"But within each of those files, sorting may help to skip blocks.Am i right?
Yes. Often this is even wanted. For example you have 10 countries and want to filter by them. If you had distributed by the country all rows for that country end up in one file. And the mappers in that file will take a long time because there would be full 256MB blocks of that country. However if you have ten files the data would be aggregated in parallel because each file would have a couple stripes with the country but could skip most of the data in the block.
"Does ORC maintain index at the block level or the stripe level?"
It actually does so sub stripe ( 10000 rows at a time ). It has to read the Stripe footer though.
"3.And on "Optimized", I understand in terms of the performance but still it has more reducers than the normal load, so how does it fix the small file problem?
Which one is normal? You mean bucketed or just in mapper?
Bucketed has one reducer for each bucket. So if you have 30 buckets and 40 partitions you have 1200 files in the end. However you wrote that with 30 reducers which would be slow for large amounts of data.
In optimized load with 30 buckets and 40 partitions you would have 1200 files as well. However you wrote that with 1200 reducers which on a big cluster is 40x faster than before. Also optimized sorts the data so only needs to keep one target at a time in memory. So it uses much less RAM than map or bucketed load. ( Too much memory use can result in OOM errors or in files with small stripes. )
"4. May be PPD is only for ORC format, but the other concepts of partitioning, bucketing, optimized apply to other formats as well?"
True, PPD is only for ORC and Parquet but partitioning bucketing etc. are valid for every format. The only thing is that delimited or Sequence writers do not need to keep a memory buffer for open files so they do not have the same problems with OOM or small stripes.
Created 03-28-2016 01:21 PM
@Benjamin Leonhardi On 3, by normal load i referred to your slide 14 and 15. As per that, if you have 30 buckets and 40 partitions, you would have 30 reducers in total (one reducer per bucket across all partitions). So its only 30 files versus 1200 files in the optimized case.
That's why i still wonder how it fixes the small file problem (as per slide 16). At the same time i understand the fact about the performance and memory issue. Its really optimized in these 2 cases.
Created 03-30-2016 01:24 PM
@Benjamin Leonhardi I think i got it. Its still the same number of files but with more reducers. In my mind, it was always just the buckets not the partitions. So i thought its 30 files (30 buckets and 40 partitions), but in fact its still 1200 files in both the case but in optimized its more number of reducers.