In hive map reduce job log, the statement "Time taken for load dynamic partitions" shows time taken by job for HDFS write. In one of my job its showing as Time taken for load dynamic partitions: 54191 , i am not sure whether this is shown in milliseconds or seconds.Cumulative time taken by this job is around 3 hrs ,for performance tuning i want to analyze from 3 hrs show much actual time is taken for HDFS write.
... View more
Hi Geeks, Need yours inputs on below scenario. We never used bucketing for our hive tables, we have table with below structure where transaction_dt is partitioned and shop_id column we are thinking to have bucket. We'll be having other tables in data lake where last two columns ( transaction_dt & shop_id) will be same and will be used for join on shop_id. Each table will be loaded daily & will have variable data size on daily basis, we'll be doing insert overwrite on each transaction day.
colA, colB, colC, transaction_dt string ( this is partitioned column) shop_id string ( thinking to have bucket on this & can not have partition on this as count is more than 10K) Below are my questions where i need your inputs. 1. If we do bucketing on shop_id , will that help in performance ? We'll be using transaction_date and shop_id columns in our where condition but does bucketing on shop_id will help us for faster retrieval. 2. Lets say if we add bucketing on shop_id , what will be correct no of buckets to use?
Note:- Tables are created in parquet format, ORC is not possible because of consistency with data lake other tables.
... View more