Reply
New Contributor
Posts: 9
Registered: ‎02-28-2017
Accepted Solution

How to see Mapreduce Spill Disk Activity

Hi 

 

Just started learning Hadoop, I have no idea about as to how to check if a mapreduce job is making spill or not . if so correct me if i am wrong we have to increase io-sort size , please help me out with this.

 

1 . Also what are all the  other parameters that needs to be checked if there is too much spill in mapred-site.xml , hadoop-env.sh files. 

 

 

 

Posts: 519
Topics: 14
Kudos: 92
Solutions: 45
Registered: ‎09-02-2016

Re: How to see Mapreduce Spill Disk Activity

@matt123

 

Go to link http://ipaddress:8088 and check the Cluster Metrics for the RAM, Container, vcore usage

 

Also Click on "active nodes" to see the same information by node

 

Cloudera Manager -> HDFS -> Web UI -> Namenode UI -> See the complete metrics

 

New Contributor
Posts: 9
Registered: ‎02-28-2017

Re: How to see Mapreduce Spill Disk Activity

Thanks for the information. 

Does hadoop metrics are collected by default or should we have to enable it. ? Could you please tell me 

 

Also one more quick clarification if there is too much spill in mapreduce job does it mean we have to increase io-sort mb , if so whats an ideal number should be can i start with 1000.

 

 mapreduce.task.io.sort.mb
Posts: 642
Topics: 3
Kudos: 120
Solutions: 67
Registered: ‎08-16-2016

Re: How to see Mapreduce Spill Disk Activity

The best indicators are the job counters. Take a look at FILE: Number of bytes written and Spilled Records especially in relation to Map output records. If the spill records are a large portion of the map output records you are spilling a lot.

The first setting below determines how much memory to use for the Map sort and the spill percentage is when it starts spilling to disk as a portion of the first setting. You can tweak both to reduce the amount spilled. The io.sort.mb is a port of the Map heap so there isn't a clear cut "it should be X". You can play around and test it for your job to see how much you can give without slowing down your Mappers from processing data. You could also increase the mapper memory as you increase the io.sort.mb.

mapreduce.task.io.sort.mb
mapreduce.map.sort.spill.percent
Champion
Posts: 768
Registered: ‎05-16-2016

Re: How to see Mapreduce Spill Disk Activity

[ Edited ]

in mapred-site.xml 

 

mapreduce.map.memory.mb = 
 
mapreduce.task.io.sort.mb =
New Contributor
Posts: 9
Registered: ‎02-28-2017

Re: How to see Mapreduce Spill Disk Activity

[ Edited ]

@mbigelow 

 

@mbigelow   - Could you please clarify this -  You could also increase the mapper memory as you increase the io.sort.mb. 
 
1 . is it mandatory to increase the mapper memory as we increase io.sort.mb - does it have a  dependencies . 
 
2. Say if I increase the mapper memory then follow up I have to increase the 
yarn.scheduler.maximum-allocation-mb  because of the  yarn.nodemanager.vmem-pmem-ratio = 2.1
yarn.nodemanager.resource.memory.mb = 8192
 
mapreduce.map.java.opts = 2.5GB
mapreduce.map.memory.mb = 3 gb
 
mapreduce.task.io.sort.mb = 4gb - I can do this . 
  
3. yarn.scheduler.maximum-allocation-mb   = 8024 - Will i  be able to increase the more than 8GB if I have enough Ram in my system.
 
Thanks for the help

New Contributor
Posts: 9
Registered: ‎02-28-2017

Re: How to see Mapreduce Spill Disk Activity

Thanks

Posts: 642
Topics: 3
Kudos: 120
Solutions: 67
Registered: ‎08-16-2016

Re: How to see Mapreduce Spill Disk Activity

yarn.scheduler.maximum-allocation-mb - This is the max memory that a single container can get

yarn.nodemanager.resource.memory-mb  - This is how much memory per NM is allocated for containers

 

I always set yarn.scheduler.maximum-allocation-mb eqaul to yarn.nodemanager.resource.memory-mb as the single largest container I could do on a host would be the amount of memory on a host allocated for YARN.

You can set the yarn.scheduler.maximum-allocation-mb to any value, as mention it should not exceed what you set for yarn.nodemanager.resource.memory-mb.  If it does, it won't harm anything until someone tries to get a container > yarn.nodemanager.resource.memory-mb.

 

You might be able to set the configuration like this:

 

mapreduce.task.io.sort.mb = 4gb - I can do this . 

 

The issue is that the sort buffer is part of the heap of the mapper.  For instance a mapper of 3 GB and a heap of 2.5 GB would mean that the sort buffer could quickly fill up the 2.5 GB of heap available.  You may not always hit OOM but it is likely due to the poor configuration.  In summary yarn.nodemanager.resource.memory-mb > mapreduce.map.memory.mb > mapreduce.task.io.sort.  It is not mandatory to increase the mapreduce.map.memory.mb if you increase the sort buffer.

 

Lets use another example, say you are using 4 GB container with a heap of 3.2 heap.  You are spilling a lot of records because you still are using the default sort buffer size.  So you increase it to 1 GB.  You just shrunk the available memory of your heap from 3.1 GB (3.2 - 100 M, roughly) to 2.2 GB (3.2 - 1).  To compensate you could just increase your heap, and along with that your mapper memory.  In ths example it would then look like 5 GB container, 4.2 GB heap, and 1 GB sort buffer.

Highlighted
New Contributor
Posts: 9
Registered: ‎02-28-2017

Re: How to see Mapreduce Spill Disk Activity

[ Edited ]

@mbigelow

Thanks for the explanation with example. its clear.

One last clarification 

The default - yarn.scheduler.maximum-allocation-mb   = 8024 - Will i  be able to increase more than 8GB if I have enough Ram in my system.

Posts: 642
Topics: 3
Kudos: 120
Solutions: 67
Registered: ‎08-16-2016

Re: How to see Mapreduce Spill Disk Activity

Yes. That settings only effect it to put a cap on how large a container can be. It does not mean that your containers will be this size. The yarn.scheduler.minimum-allocation-mb will set the container size if one is not provided by the user.
Announcements