Hi Geeks,
Need yours inputs on below scenario.
We never used bucketing for our hive tables, we have table with below structure where transaction_dt is partitioned and shop_id column we are thinking to have bucket.
We'll be having other tables in data lake where last two columns ( transaction_dt & shop_id) will be same and will be used for join on shop_id. Each table will be loaded daily & will have variable data size on daily basis, we'll be doing insert overwrite on each transaction day.
colA,
colB,
colC,
transaction_dt string ( this is partitioned column)
shop_id string ( thinking to have bucket on this & can not have partition on this as count is more than 10K)
Below are my questions where i need your inputs.
1. If we do bucketing on shop_id , will that help in performance ? We'll be using transaction_date and shop_id columns in our where condition but does bucketing on shop_id will help us for faster retrieval.
2. Lets say if we add bucketing on shop_id , what will be correct no of buckets to use?
Note:- Tables are created in parquet format, ORC is not possible because of consistency with data lake other tables.