@Radhakrishnan Rkyarn-utility script gives you more details about this consider the below:
1) Consider giving 80-85% of system resources to YARN (Vcores and Node memory)
2) As you are looking to process 1-2TB of data you can consider giving minimum container size (2GB - 2048MB) and Max container size as (125GB - 128000MB)
3) it is always recommended to go with ORC and Zlib/Snappy compression
4) If you are looking to use data processing with HIVE always consider using TEZ engine, with CBO and vectorization enabled along with Partitioning and bucketing
5) We don't have to specify the container resources a very high number as YARN is elastic (it will get the resources required) for TEZ consider the below properties:
set hive.tez.container.size=18000;
set tez.runtime.unordered.output.buffer.size-mb=3276;
set hive.tez.java.opts=-Xmx15000m;
set hive.optimize.sort.dynamic.partition=true;
set tez.runtime.io.sort.mb=13107;
set hive.auto.convert.join=false;
set hive.exec.parallel=true;
set hive.join.cache.size=50000;
set hive.join.emit.interval=25000;
Hope this helps.
Thanks
Venkat