Support Questions

matt123 · ‎02-28-2017

Hi

Just started learning Hadoop, I have no idea about as to how to check if a mapreduce job is making spill or not . if so correct me if i am wrong we have to increase io-sort size , please help me out with this.

1 . Also what are all the other parameters that needs to be checked if there is too much spill in mapred-site.xml , hadoop-env.sh files.

mbigelow · ‎03-01-2017

The best indicators are the job counters. Take a look at FILE: Number of bytes written and Spilled Records especially in relation to Map output records. If the spill records are a large portion of the map output records you are spilling a lot.

The first setting below determines how much memory to use for the Map sort and the spill percentage is when it starts spilling to disk as a portion of the first setting. You can tweak both to reduce the amount spilled. The io.sort.mb is a port of the Map heap so there isn't a clear cut "it should be X". You can play around and test it for your job to see how much you can give without slowing down your Mappers from processing data. You could also increase the mapper memory as you increase the io.sort.mb.

mapreduce.task.io.sort.mb
mapreduce.map.sort.spill.percent

View solution in original post

saranvisa · ‎02-28-2017

@matt123

Go to link http://ipaddress:8088 and check the Cluster Metrics for the RAM, Container, vcore usage

Also Click on "active nodes" to see the same information by node

Cloudera Manager -> HDFS -> Web UI -> Namenode UI -> See the complete metrics

matt123 · ‎02-28-2017

Thanks for the information.

Does hadoop metrics are collected by default or should we have to enable it. ? Could you please tell me

Also one more quick clarification if there is too much spill in mapreduce job does it mean we have to increase io-sort mb , if so whats an ideal number should be can i start with 1000.

 mapreduce.task.io.sort.mb

mbigelow · ‎03-01-2017

The best indicators are the job counters. Take a look at FILE: Number of bytes written and Spilled Records especially in relation to Map output records. If the spill records are a large portion of the map output records you are spilling a lot.

The first setting below determines how much memory to use for the Map sort and the spill percentage is when it starts spilling to disk as a portion of the first setting. You can tweak both to reduce the amount spilled. The io.sort.mb is a port of the Map heap so there isn't a clear cut "it should be X". You can play around and test it for your job to see how much you can give without slowing down your Mappers from processing data. You could also increase the mapper memory as you increase the io.sort.mb.

mapreduce.task.io.sort.mb
mapreduce.map.sort.spill.percent

csguna · ‎03-01-2017

in mapred-site.xml

mapreduce.map.memory.mb = 
 
mapreduce.task.io.sort.mb =

matt123 · ‎03-01-2017

Thanks

matt123 · ‎03-01-2017

@mbigelow

@mbigelow - Could you please clarify this - You could also increase the mapper memory as you increase the io.sort.mb.

1 . is it mandatory to increase the mapper memory as we increase io.sort.mb - does it have a dependencies .

2. Say if I increase the mapper memory then follow up I have to increase the
yarn.scheduler.maximum-allocation-mb because of the yarn.nodemanager.vmem-pmem-ratio = 2.1
yarn.nodemanager.resource.memory.mb = 8192

mapreduce.map.java.opts = 2.5GB
mapreduce.map.memory.mb = 3 gb

mapreduce.task.io.sort.mb = 4gb - I can do this .

3. yarn.scheduler.maximum-allocation-mb = 8024 - Will i be able to increase the more than 8GB if I have enough Ram in my system.

Thanks for the help

mbigelow · ‎03-01-2017

yarn.scheduler.maximum-allocation-mb - This is the max memory that a single container can get

yarn.nodemanager.resource.memory-mb - This is how much memory per NM is allocated for containers

I always set yarn.scheduler.maximum-allocation-mb eqaul to yarn.nodemanager.resource.memory-mb as the single largest container I could do on a host would be the amount of memory on a host allocated for YARN.

You can set the yarn.scheduler.maximum-allocation-mb to any value, as mention it should not exceed what you set for yarn.nodemanager.resource.memory-mb. If it does, it won't harm anything until someone tries to get a container > yarn.nodemanager.resource.memory-mb.

You might be able to set the configuration like this:

mapreduce.task.io.sort.mb = 4gb - I can do this .

The issue is that the sort buffer is part of the heap of the mapper. For instance a mapper of 3 GB and a heap of 2.5 GB would mean that the sort buffer could quickly fill up the 2.5 GB of heap available. You may not always hit OOM but it is likely due to the poor configuration. In summary yarn.nodemanager.resource.memory-mb > mapreduce.map.memory.mb > mapreduce.task.io.sort. It is not mandatory to increase the mapreduce.map.memory.mb if you increase the sort buffer.

Lets use another example, say you are using 4 GB container with a heap of 3.2 heap. You are spilling a lot of records because you still are using the default sort buffer size. So you increase it to 1 GB. You just shrunk the available memory of your heap from 3.1 GB (3.2 - 100 M, roughly) to 2.2 GB (3.2 - 1). To compensate you could just increase your heap, and along with that your mapper memory. In ths example it would then look like 5 GB container, 4.2 GB heap, and 1 GB sort buffer.

matt123 · ‎03-02-2017

@mbigelow

Thanks for the explanation with example. its clear.

One last clarification

The default - yarn.scheduler.maximum-allocation-mb = 8024 - Will i be able to increase more than 8GB if I have enough Ram in my system.

mbigelow · ‎03-02-2017

Yes. That settings only effect it to put a cap on how large a container can be. It does not mean that your containers will be this size. The yarn.scheduler.minimum-allocation-mb will set the container size if one is not provided by the user.

Cloudera Community

Support Questions

How to see Mapreduce Spill Disk Activity