Team - I am still trying to grasp the idea of performance tuning YARN in CDH. I have a few doubts that I want to clear out before I can actually start to look into details more. Please keep in mind, I have very minimal knowledge of java and I apologize if Im asking very dumb questions. How do we tune YARN Questions below
1. What is the difference between a HDFS block and a input split? Is input split equivalent to a block in Mapreduce. I know the concept of 1 block=1 mapper container but what is the difference between a block and a split. Please explain thoroughly/detailed.
2. How much spilling MR jobs is bad? How many spilled records is bad?
3. If we have 128 MB file and my HDFS block size is 64MB then I have 2 blocks which have two mapper containers and those two mapper containers have 1GB allocated to each mapper container and 800MB for heap memory for mapper... Why is there a need of 1GB mapper container to process 64MB of data??? Why is there a need for 800MB of heap size for each 1GB MAPPER container???
4. What is the when I receive "java.lang.OutOfMemoryError: GC overhead limit exceeded" for a mapreduce job in the map or reduce phase
5. What is a block count issue when I use small files. I know the block sizes or storage isnt being wasted but how does this affect performance of MR jobs
6. SPARK is very confusing as to how it works with YARN. Where do we even begin to fix SPARK jobs on YARN, there is no proper book or documentation as to explaining SPARK architecture