Support Questions

Find answers, ask questions, and share your expertise

hive - Deciding if bucketing is needed or not

New Contributor

Hi Geeks,
Need yours inputs on below scenario.
We never used bucketing for our hive tables, we have table with below structure where transaction_dt is partitioned and shop_id column we are thinking to have bucket.
We'll be having other tables in data lake where last two columns ( transaction_dt & shop_id) will be same and will be used for join on shop_id. Each table will be loaded daily & will have variable data size on daily basis, we'll be doing insert overwrite on each transaction day.

transaction_dt string ( this is partitioned column)
shop_id string ( thinking to have bucket on this & can not have partition on this as count is more than 10K)

Below are my questions where i need your inputs.
1. If we do bucketing on shop_id , will that help in performance ? We'll be using transaction_date and shop_id columns in our where condition but does bucketing on shop_id will help us for faster retrieval.
2. Lets say if we add bucketing on shop_id , what will be correct no of buckets to use?

Note:- Tables are created in parquet format, ORC is not possible because of consistency with data lake other tables.


New Contributor

Any input here?