Thanks all for your answers. Actually, I have both sorted and bucketed by customer_id. The fact is, Predicate Push Down already yields a very good improvement, and my data is (for now) not too big.I also expect some secondary effects - more smaller files in particular. However, in order to use PPD, I need to sort data by partitions. A global sort ("order by") would be definitly to slow, and I use bucketing to control the numbers of reducers and find a good compromises between a global sort and the writing speed. Would you recommand another approach? The strange (and not relevant) fact is that I have another columns linked to customer_id, say customer_name : for my experiments, customer_id and customer_name have the same order. So Predicate Push down apply to both of the columns, but bucketing only apply to customer_id. For large number of buckets, I would expect that a lookup by customer_name would scan every buckets, and then is lower that a lookup on customer_id. This is not what I observed.
... View more
I read many times that bucketing helps for some queries - especially for map join, but nothing clear about the where clause. I have a classical usecase with a big fact table stored as ORC. The table contains a customer_id with a hight cardinality (~10 millions). I want to quickly retrieve the sales (facts) from a customer_id. My data model is : Table Sales : Partition by year/month/day. Sort by customer_id ; Bucketed by customer_id INTO N buckets I have tested with no buckets, and differents values of N. Some queries with "group by customer_id" run more quickly with bucketing, but a query like "select * from sales where customer_id = XXX" seems to not benefit from bucketing. As a user, this looks a bit curious: I thought bucketing could help pruning files and therefore accelerate scan. Any advice? Am I right to conclude that bucketing do not help for such queries?
... View more