We are having a hive table in which daily data needs to be inserted. Incoming data file usually used to be size of around 150 GB. By default it was set to use TEZ as the execution engine and it used to take around 7-8 hrs to load the file completely.
But when we switched execution engine to MR and also enabled parallel processing, it was super fast and it is taking around 30-45 minutes to load same data.
Can anybody helps us to understand why is it so?
Following are the configurations used:
Configuration before i.e. using TEZ
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; SET hive.auto.convert.join.noconditionaltask.size=340000000; set hive.tez.container.size=1024; set tez.runtime.io.sort.mb=410; set hive.execution.engine=tez; set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.merge.tezfiles=true; set hive.merge.smallfiles.avgsize=160000000;
Configuration after i.e. using MR (Map Reduce)
SET hive.execution.engine=mr; set hive.exec.parallel=true; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.merge.tezfiles=true; set hive.merge.mapredfiles=true; set hive.merge.smallfiles.avgsize=102400000000;
... View more