I read many times that bucketing helps for some queries - especially for map join, but nothing clear about the where clause.
I have a classical usecase with a big fact table stored as ORC. The table contains a customer_id with a hight cardinality (~10 millions). I want to quickly retrieve the sales (facts) from a customer_id. My data model is :
Table Sales : Partition by year/month/day. Sort by customer_id ; Bucketed by customer_id INTO N buckets
I have tested with no buckets, and differents values of N.
Some queries with "group by customer_id" run more quickly with bucketing, but a query like "select * from sales where customer_id = XXX" seems to not benefit from bucketing.
As a user, this looks a bit curious: I thought bucketing could help pruning files and therefore accelerate scan.
Any advice? Am I right to conclude that bucketing do not help for such queries?
Conceptually, your query should run faster with bucketing.
You have done set.hive.enforce.bucketing=true; right?
Bucketing on customer_id and performing a query like select *... where customer_id='xx' shall search only 1/N buckets. (based on customer_id Hash)
However this would still search all partitions for that bucket since you're doing a select *
Performance for a query like select year from sales where customer_id='xx' would be the best you'd get as it eliminates source data both by partition and bucket (looks into target partition, within partition looks into target bucket)
Can you try checking performance for the above kind of query?
Also it may happen that your data may not get evenly distributed b/w buckets, depending on your data. In that case you won't see much improvement.
Please read this once. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables
You have to understand what bucketing really does when you are writing new records. For high cardinality columns, for example customer_id in your case, you can create thousands of customers and then bucket them into few files (let's say 256 buckets). Now when you run query, your customer_id will be hashed and Hive will search only the bucket where your customer-id is expected to exist and not a complete table space scan. That being said, it is still reading that one file which contains your customer_id and that file can be large.
When you say, bucketing doesn't help, I think you are testing on a small scale or your buckets are not bigger compared to your cluster size (4 node cluster with 50 buckets but data small enough for four nodes, will parallel run query on all data even without bucketing and you won't see the benefit of bucketing - You do more complex operations like join between bucketing tables and you'll start seeing the benefits).
So, bucketing definitely helps, but for a simple select, that benefit might not be very visible.
Bucketing is a tricky feature. Users sometimes bucket their tables believing it will solve all performance problems and are subsequently disappointed. Users should be confident they understand all implications of bucketing before they use it, and only use it if other approaches do not work. There are two major reasons to use bucketing:
1) The bucket join. If two large tables are "compatible" in terms of identical partition keys and compatible numbers of buckets, an optimization can kick in to accelerate the join. Note that all other joins will proceed as normal. Since bucketing creates additional files this can harm performance. Consider this only when you have an extremely expensive join and the problem cannot be addressed any other way.
2) Bucket pruning. Bucket pruning was added in Hive 2.X. If you are using Hive 1.X there is no bucket pruning. You should first understand that with ORCFile, there are inline indexes that allow large portions of the file to be skipped if it is known the records cannot match (e.g. in a point lookup). If your table is ordered by customer ID and your point lookup is based on customer ID, minimal I/O is needed to find matching records in ORC and bucketing is not helpful. Bucketing is appropriate if you partition by one key and sort by another, commonly you will sort by a timestamp. If your sort and partition keys do not match, bucket pruning (in Hive 2.X) can help point lookup queries.
Generally speaking I tell users that bucketing is an optimization of last resort. Usually there are better ways to get what you want.
Thanks all for your answers.
Actually, I have both sorted and bucketed by customer_id. The fact is, Predicate Push Down already yields a very good improvement, and my data is (for now) not too big.I also expect some secondary effects - more smaller files in particular.
However, in order to use PPD, I need to sort data by partitions. A global sort ("order by") would be definitly to slow, and I use bucketing to control the numbers of reducers and find a good compromises between a global sort and the writing speed. Would you recommand another approach?
The strange (and not relevant) fact is that I have another columns linked to customer_id, say customer_name : for my experiments, customer_id and customer_name have the same order. So Predicate Push down apply to both of the columns, but bucketing only apply to customer_id. For large number of buckets, I would expect that a lookup by customer_name would scan every buckets, and then is lower that a lookup on customer_id. This is not what I observed.