We’ve had some MR jobs consume temp space on data nodes and we want to prevent that in the future. There is conflicting information on the internet. What is the best practice for where data node temp space should reside? (separate physical disks from HDFS physical disks? Etc.) How much temp space should be reserved for MR as a % of HDFS space? (for example is there a best practice target?) What are MR best practices around temp space? (For example we can instruct users to shrink reduce steps but sometimes this is hard to predict for analytics jobs – are there any best practices configurations that can help?) How do file sizes impact usage of temp space especially with compressed files on HDFS?
... View more
We have a couple questions around scheduler configuration options. We’ve had a couple of incidents where a single user’s job blocked other user’s job by consuming all of the mappers. The jobs finished however because those jobs were tied to a web application, the web interface timed-out. We are also concerned that in the future large analytical MR jobs could consume cpu/memory and severely impact search jobs that are tied to a web interface and timeout issues could occur. Currently we are using FIFO scheduler. We want to give priority to the web application jobs and limit other analytic jobs. Answers to these specific questions would be helpful: To confirm, with FIFO scheduler there aren’t any site/cluster configuration options to prevent job waiting / conflict issues? As a work around we could limit mappers at run time when submitting large analytic jobs? Does it make sense to transition to FAIR or YARN for better multi-tenant Hadoop? If so given we are using CDH 4.2.1 what is the upgrade path to FAIR or YARN?
... View more