- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Jobs fail in Yarn with out of Java heap memory error
Created on ‎09-15-2014 08:39 AM - edited ‎09-16-2022 02:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are running Yarn on CDH 5.1 with 14 nodes using 6 GB of memory. I understand this is not a log of memory, but it is all we could put together. Most jobs complete without error, but a few of the larger MapReduce jobs fail with an out of Java heap memory error. The jobs fail on a Reduce task that either sorts or groups data. We recently upgraded to CDH 5.1 from CDH 4.7 and ALL of these jobs succeeded on MapReduce v1. Looking in the logs I see that the Application has retired a few times before failing. Can you see anything wrong with the way the resources are configured?
Java Heap Size of NodeManager in Bytes | 1 GB |
yarn.nodemanager.resource.memory-mb | 6 GB |
yarn.scheduler.minimum-allocation-mb | 1 GB |
yarn.scheduler.maximum-allocation-mb | 6 GB |
yarn.app.mapreduce.am.resource.mb | 1.5 GB |
yarn.nodemanager.container-manager.thread-count | 20 |
yarn.resourcemanager.resource-tracker.client.thread-count | 20 |
mapreduce.map.memory.mb | 1.5 GB |
mapreduce.reduce.memory.mb | 3 GB |
mapreduce.map.java.opts | "-Djava.net.preferIPv4Stack=true -Xmx 1228m"; |
mapreduce.reduce.java.opts | "-Djava.net.preferIPv4Stack=true -Xmx2457m"; |
mapreduce.task.io.sort.factor | 5 |
mapreduce.task.io.sort.mb | 512 MB |
mapreduce.job.reduces | 2 |
mapreduce.reduce.shuffle.parallelcopies | 4 |
One thing that might help, Yarn runs 4 containers per node, can this be reduced?
Created ‎09-17-2014 10:12 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
<value>-Djava.net.preferIPv4Stack=true -Xmx1280m -Xmx825955249</value>
Limits the heap to ~825MB. Most JVMs resolve duplicate args by picking the last one. So this is nowhere close to the 3GB that you intended. You should find out where you set this in CM and change it.
Do that before you play with parallelcopies. But to answer your questions, yes, it'll increase CPU, memory & network usage. And it could lead to more disk spills and slow down your job.
Created ‎09-15-2014 09:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What are your MR1 settings? Do reducers used to get -Xmx2457m on MR1?
Also, the AM memory at 1.5GB is a bit high. You could probably cut that to 1GB.
Created ‎09-15-2014 10:30 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks bcwalrus, very good question:
In MRv1, we configured the Java Heap Size of TaskTracker in Bytes with: 600 MB. Do you think I've set this too high in MRv2?
I'll cut the AM memory down to 1 GB, that is good advice. That will save me some memory on the node.
Kevin
Created ‎09-15-2014 05:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You said that the reducers are failing due to OOME. They're getting 2457MB in MR2. What did they get in MR1?
Created ‎09-16-2014 08:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎09-16-2014 09:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎09-16-2014 09:36 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎09-16-2014 10:23 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From the Yarn logs I can see that Yarn believes that a huge amount of virtual memory is available before the job is killed, why is it using so much Virtual memory? Where is this set?
2014-09-16 10:18:30,803 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 51870 for container-id container_1410882800578_0001_01_000001: 797.0 MB of 2.5 GB physical memory used; 1.8 GB of 5.3 GB virtual memory used 2014-09-16 10:18:33,829 INFO
...
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1410882800578_0005_01_000048 2014-09-16 10:18:34,431 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=admin IP=192.168.210.251 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1410882800578_0005 CONTAINERID=container_1410882800578_0005_01_000048 2014-09-16 10:18:34,432 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410882800578_0005_01_000048 transitioned from RUNNING to KILLING 2014-09-16 10:18:34,433 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1410882800578_0005_01_000048 2014-09-16 10:18:34,462 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1410882800578_0005_01_000048 is : 143 2014-09-16 10:18:34,550 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410882800578_0005_01_000048 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 2014-09-16 10:18:34,553 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /space1/yarn/nm/usercache/admin/appcache/application_1410882800578_0005/container_1410882800578_0005_01_000048 2014-09-16 10:18:34,556 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /space2/yarn/nm/usercache/admin/appcache/application_1410882800578_0005/container_1410882800578_0005_01_000048 2014-09-16 10:18:34,558 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=admin OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1410882800578_0005 CONTAINERID=container_1410882800578_0005_01_000048
Created ‎09-16-2014 10:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That shouldn't matter though. You said that the job died due to OOME. It didn't die because it got killed by NM.
Created ‎09-16-2014 10:38 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, do you know if it would be helpful to increase the mapreduce.reduce.java.opts.max.heap from the current setting of 787.69 MiB? Or is this not helpful?
