Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

hive - Deciding if bucketing is needed or not

Highlighted

hive - Deciding if bucketing is needed or not

New Contributor

Hi Geeks,
Need yours inputs on below scenario.
We never used bucketing for our hive tables, we have table with below structure where transaction_dt is partitioned and shop_id column we are thinking to have bucket.
We'll be having other tables in data lake where last two columns ( transaction_dt & shop_id) will be same and will be used for join on shop_id. Each table will be loaded daily & will have variable data size on daily basis, we'll be doing insert overwrite on each transaction day.

colA,
colB,
colC,
transaction_dt string ( this is partitioned column)
shop_id string ( thinking to have bucket on this & can not have partition on this as count is more than 10K)

Below are my questions where i need your inputs.
1. If we do bucketing on shop_id , will that help in performance ? We'll be using transaction_date and shop_id columns in our where condition but does bucketing on shop_id will help us for faster retrieval.
2. Lets say if we add bucketing on shop_id , what will be correct no of buckets to use?

Note:- Tables are created in parquet format, ORC is not possible because of consistency with data lake other tables.

1 REPLY 1

Re: hive - Deciding if bucketing is needed or not

New Contributor

Fellas,
Any input here?

Don't have an account?
Coming from Hortonworks? Activate your account here