Support Questions

IT.Services · ‎09-15-2014

We are running Yarn on CDH 5.1 with 14 nodes using 6 GB of memory. I understand this is not a log of memory, but it is all we could put together. Most jobs complete without error, but a few of the larger MapReduce jobs fail with an out of Java heap memory error. The jobs fail on a Reduce task that either sorts or groups data. We recently upgraded to CDH 5.1 from CDH 4.7 and ALL of these jobs succeeded on MapReduce v1. Looking in the logs I see that the Application has retired a few times before failing. Can you see anything wrong with the way the resources are configured?

Java Heap Size of NodeManager in Bytes	1 GB
yarn.nodemanager.resource.memory-mb	6 GB
yarn.scheduler.minimum-allocation-mb	1 GB
yarn.scheduler.maximum-allocation-mb	6 GB
yarn.app.mapreduce.am.resource.mb	1.5 GB
yarn.nodemanager.container-manager.thread-count	20
yarn.resourcemanager.resource-tracker.client.thread-count	20
mapreduce.map.memory.mb	1.5 GB
mapreduce.reduce.memory.mb	3 GB
mapreduce.map.java.opts	"-Djava.net.preferIPv4Stack=true -Xmx 1228m";
mapreduce.reduce.java.opts	"-Djava.net.preferIPv4Stack=true -Xmx2457m";
mapreduce.task.io.sort.factor	5
mapreduce.task.io.sort.mb	512 MB
mapreduce.job.reduces	2
mapreduce.reduce.shuffle.parallelcopies	4

One thing that might help, Yarn runs 4 containers per node, can this be reduced?

bcwalrus · ‎09-17-2014

<name>mapreduce.reduce.java.opts</name>
<value>-Djava.net.preferIPv4Stack=true -Xmx1280m -Xmx825955249</value>

Limits the heap to ~825MB. Most JVMs resolve duplicate args by picking the last one. So this is nowhere close to the 3GB that you intended. You should find out where you set this in CM and change it.

Do that before you play with parallelcopies. But to answer your questions, yes, it'll increase CPU, memory & network usage. And it could lead to more disk spills and slow down your job.

View solution in original post

bcwalrus · ‎09-15-2014

What are your MR1 settings? Do reducers used to get -Xmx2457m on MR1?

Also, the AM memory at 1.5GB is a bit high. You could probably cut that to 1GB.

IT.Services · ‎09-15-2014

Thanks bcwalrus, very good question:

In MRv1, we configured the Java Heap Size of TaskTracker in Bytes with: 600 MB. Do you think I've set this too high in MRv2?

I'll cut the AM memory down to 1 GB, that is good advice. That will save me some memory on the node.

Kevin

bcwalrus · ‎09-15-2014

I'm not asking about the heap of the TT process. I'm asking about the -Xmx of the reducers of this particular job (which used to work in MR1 and is failing in MR2).

You said that the reducers are failing due to OOME. They're getting 2457MB in MR2. What did they get in MR1?

IT.Services · ‎09-16-2014

I don't think we ever changed the -Xmx on the reducers in MR1, this would have remained the default. Do you know what the default is for MR1?

bcwalrus · ‎09-16-2014

The default in MR1 is unlimited, for both mapred.cluster.max.reduce.memory.mb and mapred.job.reduce.memory.mb. What did you set for mapred.child.java.opts (MR1)? Do you have the job counters from a big MR1 job? It'll tell you the average memory usage across the reducers, which will give you a good idea on what to set for MR2.

IT.Services · ‎09-16-2014

Thanks for your help with this problem, I didn't know the default was unlimited. The max number of reducers for each TT was set at 2. I don't have the job counters from a big MR1 job, but I might be able to look them up. Where would I find them?

IT.Services · ‎09-16-2014

From the Yarn logs I can see that Yarn believes that a huge amount of virtual memory is available before the job is killed, why is it using so much Virtual memory? Where is this set?

2014-09-16 10:18:30,803 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 51870 for container-id container_1410882800578_0001_01_000001: 797.0 MB of 2.5 GB physical memory used; 1.8 GB of 5.3 GB virtual memory used
2014-09-16 10:18:33,829 INFO 
...
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1410882800578_0005_01_000048
2014-09-16 10:18:34,431 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=admin	IP=192.168.210.251	OPERATION=Stop Container Request	TARGET=ContainerManageImpl	RESULT=SUCCESS	APPID=application_1410882800578_0005	CONTAINERID=container_1410882800578_0005_01_000048
2014-09-16 10:18:34,432 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410882800578_0005_01_000048 transitioned from RUNNING to KILLING
2014-09-16 10:18:34,433 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1410882800578_0005_01_000048
2014-09-16 10:18:34,462 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1410882800578_0005_01_000048 is : 143
2014-09-16 10:18:34,550 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410882800578_0005_01_000048 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2014-09-16 10:18:34,553 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /space1/yarn/nm/usercache/admin/appcache/application_1410882800578_0005/container_1410882800578_0005_01_000048
2014-09-16 10:18:34,556 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /space2/yarn/nm/usercache/admin/appcache/application_1410882800578_0005/container_1410882800578_0005_01_000048
2014-09-16 10:18:34,558 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=admin	OPERATION=Container Finished - Killed	TARGET=ContainerImpl	RESULT=SUCCESS	APPID=application_1410882800578_0005	CONTAINERID=container_1410882800578_0005_01_000048

bcwalrus · ‎09-16-2014

Virtual memory checking is pointless. Please make sure that `yarn.nodemanager.vmem-check-enabled' is turned off. The CDH default is off already.

That shouldn't matter though. You said that the job died due to OOME. It didn't die because it got killed by NM.

IT.Services · ‎09-16-2014

Thanks bcwalrus, what if I increased the mapreduce.task.io.sort.factor, which is currently set to 5?

Also, do you know if it would be helpful to increase the mapreduce.reduce.java.opts.max.heap from the current setting of 787.69 MiB? Or is this not helpful?

Cloudera Community

Support Questions

Jobs fail in Yarn with out of Java heap memory error