Created 02-28-2017 07:57 PM
Hi
Just started learning Hadoop, I have no idea about as to how to check if a mapreduce job is making spill or not . if so correct me if i am wrong we have to increase io-sort size , please help me out with this.
1 . Also what are all the other parameters that needs to be checked if there is too much spill in mapred-site.xml , hadoop-env.sh files.
Created 03-01-2017 12:33 PM
Created 02-28-2017 09:10 PM
Go to link http://ipaddress:8088 and check the Cluster Metrics for the RAM, Container, vcore usage
Also Click on "active nodes" to see the same information by node
Cloudera Manager -> HDFS -> Web UI -> Namenode UI -> See the complete metrics
Created 02-28-2017 11:05 PM
Thanks for the information.
Does hadoop metrics are collected by default or should we have to enable it. ? Could you please tell me
Also one more quick clarification if there is too much spill in mapreduce job does it mean we have to increase io-sort mb , if so whats an ideal number should be can i start with 1000.
mapreduce.task.io.sort.mb
Created 03-01-2017 12:33 PM
Created on 03-01-2017 09:46 PM - edited 03-01-2017 09:54 PM
in mapred-site.xml
mapreduce.map.memory.mb = mapreduce.task.io.sort.mb =
Created 03-01-2017 09:57 PM
Thanks
Created on 03-01-2017 09:57 PM - edited 03-01-2017 09:58 PM
@mbigelow - Could you please clarify this - You could also increase the mapper memory as you increase the io.sort.mb.
1 . is it mandatory to increase the mapper memory as we increase io.sort.mb - does it have a dependencies .
2. Say if I increase the mapper memory then follow up I have to increase the
yarn.scheduler.maximum-allocation-mb because of the yarn.nodemanager.vmem-pmem-ratio = 2.1
yarn.nodemanager.resource.memory.mb = 8192
mapreduce.map.java.opts = 2.5GB
mapreduce.map.memory.mb = 3 gb
mapreduce.task.io.sort.mb = 4gb - I can do this .
3. yarn.scheduler.maximum-allocation-mb = 8024 - Will i be able to increase the more than 8GB if I have enough Ram in my system.
Thanks for the help
Created 03-01-2017 10:09 PM
yarn.scheduler.maximum-allocation-mb - This is the max memory that a single container can get
yarn.nodemanager.resource.memory-mb - This is how much memory per NM is allocated for containers
I always set yarn.scheduler.maximum-allocation-mb eqaul to yarn.nodemanager.resource.memory-mb as the single largest container I could do on a host would be the amount of memory on a host allocated for YARN.
You can set the yarn.scheduler.maximum-allocation-mb to any value, as mention it should not exceed what you set for yarn.nodemanager.resource.memory-mb. If it does, it won't harm anything until someone tries to get a container > yarn.nodemanager.resource.memory-mb.
You might be able to set the configuration like this:
mapreduce.task.io.sort.mb = 4gb - I can do this .
The issue is that the sort buffer is part of the heap of the mapper. For instance a mapper of 3 GB and a heap of 2.5 GB would mean that the sort buffer could quickly fill up the 2.5 GB of heap available. You may not always hit OOM but it is likely due to the poor configuration. In summary yarn.nodemanager.resource.memory-mb > mapreduce.map.memory.mb > mapreduce.task.io.sort. It is not mandatory to increase the mapreduce.map.memory.mb if you increase the sort buffer.
Lets use another example, say you are using 4 GB container with a heap of 3.2 heap. You are spilling a lot of records because you still are using the default sort buffer size. So you increase it to 1 GB. You just shrunk the available memory of your heap from 3.1 GB (3.2 - 100 M, roughly) to 2.2 GB (3.2 - 1). To compensate you could just increase your heap, and along with that your mapper memory. In ths example it would then look like 5 GB container, 4.2 GB heap, and 1 GB sort buffer.
Created on 03-02-2017 01:00 AM - edited 03-02-2017 01:00 AM
Thanks for the explanation with example. its clear.
One last clarification
The default - yarn.scheduler.maximum-allocation-mb = 8024 - Will i be able to increase more than 8GB if I have enough Ram in my system.
Created 03-02-2017 07:44 AM