Support Questions

Find answers, ask questions, and share your expertise

Storing (and retrieving by) unique keys in Hive


Best practices for storing and retrieving unique keys using Hive / ORC.

I have a large data set in the low billions, all data is uniquely identifiable. I will only be retrieving / joining on the unique key.

I would like to know the best way to store this in Hive / ORC. I would assume this is a case for bucketing. However, from reading Hortonwork's documentation at I am concerned with the disclaimer "We strongly recommend not to use Bucketing."

Is this an appropriate case for bucketing? Should I be concerned about the disclaimer?


@Joe Olson

I guess the disclaimer in the article is incomplete.

What they would have meant to say is to avoid using Bucketing as a generic solution to all problems.

Bucketing/Partition does not apply to all use cases. Since you will be joining datasets with millions of records, I would say that you should go for bucketing but choose the bucketing criteria and the number of buckets wisely, if needed, even try multiple options.

When we tried it for one of our clients, we tried by bucketing into 11, 13, 17, 23, 29 buckets. Then we picked the one which gave the best result, i think it was 29. Again, it depends on your data.

Once you decide to use bucketing, turn on Sort-Merge-Bucket join to take additional advantage.

@Joe Olson

As a general rule of thumb, use bucketing for high cardinality keys and partitioning for low cardinality. In the case of a unique key you would want to use bucketing.


@Scott Shaw Yes, that was blindingly obvious from the quoted article. They even printed it in bold text.

But why the disclaimer "We strongly recommend not to use Bucketing"? That is an unusually direct statement that is not open for interpretation that undermines the entire article.

I am assuming @Dinesh Chitlangia is on the right track when he says it is probably due to poor proofing.

Bucketing ordered unique keys still results in a search. Logarithmic, but a such nonetheless. Is that the best we can do here?

@Joe Olson I didn't review the article before posting. My guess is the advice is to avoid bucketing because you will have much better control over performance with partitioning. Most user queries never use a table's unique id in a WHERE clause. Just be careful not to have partitions too small. A GB or more per partition is a good rule of thumb.

Expert Contributor


Joe Olson

Here you are trying to join large datasets, one of which is low billions
lets assume table1: record count 800 Million , 20 columns, data size 4GB
table2: record count 4 Million, 10 columns 500 MB

When you are trying to Inner/outer join on these 2 tables
select A.*, B.Col2
from table1 A, table2 B
where A.Col1 = B.Col1;

when we execute such query YARN will join both datasets and generate huge internal dataset which would be greater than 10 - 15 GB

to avoid above scenario, we will split query

get key columns from both tables
Table1 - Column 1
Table2 - Column 2

and we are interseted in getting column2 from table2. This would require to create few temporary tables

Create Table1_temp as
Select Col1 from Table1

Create Table2_temp as
Select Col2 from Table2

Crete Table_columnsfromTable2
Select A.Col1, B.Col2
from table1 A, table2 B
where A.Col1 = B.Col1;

this would certainly generate lesses data set while performing join

Now join Table_columnsfromTable2 temporary table with Table1, to get your desired output
select A.*, B.Col2
from table1 A, Table_columnsfromTable2 B
where A.Col1 = B.Col1;

Tuning of HQL varies for each scenario
In some cases we can partition and have buckets.
Some cases may need to look into if incoming data is sorted, number of files.
If possible create Temporary table.