Created 08-14-2017 02:24 PM
Best practices for storing and retrieving unique keys using Hive / ORC.
I have a large data set in the low billions, all data is uniquely identifiable. I will only be retrieving / joining on the unique key.
I would like to know the best way to store this in Hive / ORC. I would assume this is a case for bucketing. However, from reading Hortonwork's documentation at https://community.hortonworks.com/content/supportkb/49637/hive-bucketing-and-partitioning.html I am concerned with the disclaimer "We strongly recommend not to use Bucketing."
Is this an appropriate case for bucketing? Should I be concerned about the disclaimer?
Created 08-14-2017 03:56 PM
I guess the disclaimer in the article is incomplete.
What they would have meant to say is to avoid using Bucketing as a generic solution to all problems.
Bucketing/Partition does not apply to all use cases. Since you will be joining datasets with millions of records, I would say that you should go for bucketing but choose the bucketing criteria and the number of buckets wisely, if needed, even try multiple options.
When we tried it for one of our clients, we tried by bucketing into 11, 13, 17, 23, 29 buckets. Then we picked the one which gave the best result, i think it was 29. Again, it depends on your data.
Once you decide to use bucketing, turn on Sort-Merge-Bucket join to take additional advantage.
Created 08-15-2017 11:19 AM
As a general rule of thumb, use bucketing for high cardinality keys and partitioning for low cardinality. In the case of a unique key you would want to use bucketing.
Created 08-15-2017 01:04 PM
@Scott Shaw Yes, that was blindingly obvious from the quoted article. They even printed it in bold text.
But why the disclaimer "We strongly recommend not to use Bucketing"? That is an unusually direct statement that is not open for interpretation that undermines the entire article.
I am assuming @Dinesh Chitlangia is on the right track when he says it is probably due to poor proofing.
Bucketing ordered unique keys still results in a search. Logarithmic, but a such nonetheless. Is that the best we can do here?
Created 08-15-2017 02:22 PM
@Joe Olson I didn't review the article before posting. My guess is the advice is to avoid bucketing because you will have much better control over performance with partitioning. Most user queries never use a table's unique id in a WHERE clause. Just be careful not to have partitions too small. A GB or more per partition is a good rule of thumb.
Created 08-15-2017 02:05 PM
@
Here you are trying to join large datasets, one of which is low billions