Support Questions

Find answers, ask questions, and share your expertise

How to see Mapreduce Spill Disk Activity

avatar
Explorer

Hi 

 

Just started learning Hadoop, I have no idea about as to how to check if a mapreduce job is making spill or not . if so correct me if i am wrong we have to increase io-sort size , please help me out with this.

 

1 . Also what are all the  other parameters that needs to be checked if there is too much spill in mapred-site.xml , hadoop-env.sh files. 

 

 

 

1 ACCEPTED SOLUTION

avatar
Champion
The best indicators are the job counters. Take a look at FILE: Number of bytes written and Spilled Records especially in relation to Map output records. If the spill records are a large portion of the map output records you are spilling a lot.

The first setting below determines how much memory to use for the Map sort and the spill percentage is when it starts spilling to disk as a portion of the first setting. You can tweak both to reduce the amount spilled. The io.sort.mb is a port of the Map heap so there isn't a clear cut "it should be X". You can play around and test it for your job to see how much you can give without slowing down your Mappers from processing data. You could also increase the mapper memory as you increase the io.sort.mb.

mapreduce.task.io.sort.mb
mapreduce.map.sort.spill.percent

View solution in original post

12 REPLIES 12

avatar
Champion

@matt123

 

Go to link http://ipaddress:8088 and check the Cluster Metrics for the RAM, Container, vcore usage

 

Also Click on "active nodes" to see the same information by node

 

Cloudera Manager -> HDFS -> Web UI -> Namenode UI -> See the complete metrics

 

avatar
Explorer

Thanks for the information. 

Does hadoop metrics are collected by default or should we have to enable it. ? Could you please tell me 

 

Also one more quick clarification if there is too much spill in mapreduce job does it mean we have to increase io-sort mb , if so whats an ideal number should be can i start with 1000.

 

 mapreduce.task.io.sort.mb

avatar
Champion
The best indicators are the job counters. Take a look at FILE: Number of bytes written and Spilled Records especially in relation to Map output records. If the spill records are a large portion of the map output records you are spilling a lot.

The first setting below determines how much memory to use for the Map sort and the spill percentage is when it starts spilling to disk as a portion of the first setting. You can tweak both to reduce the amount spilled. The io.sort.mb is a port of the Map heap so there isn't a clear cut "it should be X". You can play around and test it for your job to see how much you can give without slowing down your Mappers from processing data. You could also increase the mapper memory as you increase the io.sort.mb.

mapreduce.task.io.sort.mb
mapreduce.map.sort.spill.percent

avatar
Champion

in mapred-site.xml 

 

mapreduce.map.memory.mb = 
 
mapreduce.task.io.sort.mb =

avatar
Explorer

Thanks

avatar
Explorer

@mbigelow 

 

@mbigelow   - Could you please clarify this -  You could also increase the mapper memory as you increase the io.sort.mb. 
 
1 . is it mandatory to increase the mapper memory as we increase io.sort.mb - does it have a  dependencies . 
 
2. Say if I increase the mapper memory then follow up I have to increase the 
yarn.scheduler.maximum-allocation-mb  because of the  yarn.nodemanager.vmem-pmem-ratio = 2.1
yarn.nodemanager.resource.memory.mb = 8192
 
mapreduce.map.java.opts = 2.5GB
mapreduce.map.memory.mb = 3 gb
 
mapreduce.task.io.sort.mb = 4gb - I can do this . 
  
3. yarn.scheduler.maximum-allocation-mb   = 8024 - Will i  be able to increase the more than 8GB if I have enough Ram in my system.
 
Thanks for the help

avatar
Champion

yarn.scheduler.maximum-allocation-mb - This is the max memory that a single container can get

yarn.nodemanager.resource.memory-mb  - This is how much memory per NM is allocated for containers

 

I always set yarn.scheduler.maximum-allocation-mb eqaul to yarn.nodemanager.resource.memory-mb as the single largest container I could do on a host would be the amount of memory on a host allocated for YARN.

You can set the yarn.scheduler.maximum-allocation-mb to any value, as mention it should not exceed what you set for yarn.nodemanager.resource.memory-mb.  If it does, it won't harm anything until someone tries to get a container > yarn.nodemanager.resource.memory-mb.

 

You might be able to set the configuration like this:

 

mapreduce.task.io.sort.mb = 4gb - I can do this . 

 

The issue is that the sort buffer is part of the heap of the mapper.  For instance a mapper of 3 GB and a heap of 2.5 GB would mean that the sort buffer could quickly fill up the 2.5 GB of heap available.  You may not always hit OOM but it is likely due to the poor configuration.  In summary yarn.nodemanager.resource.memory-mb > mapreduce.map.memory.mb > mapreduce.task.io.sort.  It is not mandatory to increase the mapreduce.map.memory.mb if you increase the sort buffer.

 

Lets use another example, say you are using 4 GB container with a heap of 3.2 heap.  You are spilling a lot of records because you still are using the default sort buffer size.  So you increase it to 1 GB.  You just shrunk the available memory of your heap from 3.1 GB (3.2 - 100 M, roughly) to 2.2 GB (3.2 - 1).  To compensate you could just increase your heap, and along with that your mapper memory.  In ths example it would then look like 5 GB container, 4.2 GB heap, and 1 GB sort buffer.

avatar
Explorer

@mbigelow

Thanks for the explanation with example. its clear.

One last clarification 

The default - yarn.scheduler.maximum-allocation-mb   = 8024 - Will i  be able to increase more than 8GB if I have enough Ram in my system.

avatar
Champion
Yes. That settings only effect it to put a cap on how large a container can be. It does not mean that your containers will be this size. The yarn.scheduler.minimum-allocation-mb will set the container size if one is not provided by the user.