We’ve had some MR jobs consume temp space on data nodes and we want to prevent that in the future. There is conflicting information on the internet.
What is the best practice for where data node temp space should reside? (separate physical disks from HDFS physical disks? Etc.)
How much temp space should be reserved for MR as a % of HDFS space? (for example is there a best practice target?)
What are MR best practices around temp space? (For example we can instruct users to shrink reduce steps but sometimes this is hard to predict for analytics jobs – are there any best practices configurations that can help?)
How do file sizes impact usage of temp space especially with compressed files on HDFS?