Support Questions

Find answers, ask questions, and share your expertise

MapReduce Spill Directory

New Contributor

Each map task has circular buffer that it writes the output to. The buffer is 100MB by default( the size can be tuned by changing the mapreduce.task.io.sort.mb property). When the contents of the buffer reach a certain threshold size, controlled by a property namely mapreduce.map.sort.spill.percent, with a default value of 80%, a background thread will start to spill the contents to disk. Spills are written in round robin fashion to the directories specified by the mapreduce.cluster.local.dir property in a job specific directory.

Where is the Spill directory mentioned in the above property mapreduce.cluster.local.dir located ?

1) Local to ResourceManager machine 2) Local to DataNode machine on which the particular map program runs 3) Local to any DataNode machine otherthan the DataNode machine on which the particular map program runs

1 ACCEPTED SOLUTION

Super Guru
@Fasil Ahamed

2) Local to DataNode machine on which the particular map program runs.

Please note that Datanode is only storage component, job will be run by Nodemanager on that particular slave machine.

Generally slave node has both Datanode and Nodemanagers deployed. Datanode for storage purpose and Nodemanagers for running Mapper/Reducer/Tez etc. containers.

View solution in original post

2 REPLIES 2

Super Guru
@Fasil Ahamed

2) Local to DataNode machine on which the particular map program runs.

Please note that Datanode is only storage component, job will be run by Nodemanager on that particular slave machine.

Generally slave node has both Datanode and Nodemanagers deployed. Datanode for storage purpose and Nodemanagers for running Mapper/Reducer/Tez etc. containers.

Super Guru

@Fasil Ahamed - Can you please accept the appropriate answer?

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.