I have setup a 5 node cluster with 64 Virtual Cores and 156 GB RAM each. I need to do a patch processing job which involves a data size file of size 2TB max. Is it recommended to use avro or parquet file format to store the data. Is it recommended to compress the file with snappy or lzo file compression format.
How many map and reduce job is recommended. How should I choose container size with the available resources.
Do I need to allocate a very large amount of resources for each container say 12 Virtual Cores 36 GB RAM each or even bigger or smaller than the above size. I should consider the amount of time required to complete both map and reduce task with the allocated resources.
Can any one give me a better suggestion for my use case.
yarn-utility script gives you more details about this consider the below:
1) Consider giving 80-85% of system resources to YARN (Vcores and Node memory)
2) As you are looking to process 1-2TB of data you can consider giving minimum container size (2GB - 2048MB) and Max container size as (125GB - 128000MB)
3) it is always recommended to go with ORC and Zlib/Snappy compression
4) If you are looking to use data processing with HIVE always consider using TEZ engine, with CBO and vectorization enabled along with Partitioning and bucketing
5) We don't have to specify the container resources a very high number as YARN is elastic (it will get the resources required) for TEZ consider the below properties:
set hive.tez.container.size=18000; set tez.runtime.unordered.output.buffer.size-mb=3276; set hive.tez.java.opts=-Xmx15000m; set hive.optimize.sort.dynamic.partition=true; set tez.runtime.io.sort.mb=13107; set hive.auto.convert.join=false; set hive.exec.parallel=true; set hive.join.cache.size=50000; set hive.join.emit.interval=25000;
Hope this helps.