Created 08-18-2016 06:07 AM
Each map task has circular buffer that it writes the output to. The buffer is 100MB by default( the size can be tuned by changing the mapreduce.task.io.sort.mb property). When the contents of the buffer reach a certain threshold size, controlled by a property namely mapreduce.map.sort.spill.percent, with a default value of 80%, a background thread will start to spill the contents to disk. Spills are written in round robin fashion to the directories specified by the mapreduce.cluster.local.dir property in a job specific directory.
Where is the Spill directory mentioned in the above property mapreduce.cluster.local.dir located ?
1) Local to ResourceManager machine 2) Local to DataNode machine on which the particular map program runs 3) Local to any DataNode machine otherthan the DataNode machine on which the particular map program runs
Created 08-18-2016 06:29 AM
2) Local to DataNode machine on which the particular map program runs.
Please note that Datanode is only storage component, job will be run by Nodemanager on that particular slave machine.
Generally slave node has both Datanode and Nodemanagers deployed. Datanode for storage purpose and Nodemanagers for running Mapper/Reducer/Tez etc. containers.
Created 08-18-2016 06:29 AM
2) Local to DataNode machine on which the particular map program runs.
Please note that Datanode is only storage component, job will be run by Nodemanager on that particular slave machine.
Generally slave node has both Datanode and Nodemanagers deployed. Datanode for storage purpose and Nodemanagers for running Mapper/Reducer/Tez etc. containers.
Created 08-29-2016 01:31 AM
@Fasil Ahamed - Can you please accept the appropriate answer?