Reply
New Contributor
Posts: 2
Registered: ‎07-24-2017

Yarn RM crash by divided by zero error

We encounter a crash at Yarn ResourceManager several days ago:

2018-07-02 22:31:11,722 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler
java.lang.ArithmeticException: / by zero
        at org.apache.hadoop.yarn.util.resource.DominantResourceCalculator.computeAvailableContainers(DominantResourceCalculator.java:115)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1546)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1402)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1281)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:815)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:586)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:447)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:586)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:447)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1027)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1069)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:114)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:691)
        at java.lang.Thread.run(Thread.java:745)
2018-07-02 22:31:11,722 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..

 

Looking into the codes (we're using cdh5.7.3), it looks like someone is requiring a resource with 0 vcores or memory:

113  @Override
114  public int computeAvailableContainers(Resource available, Resource required) {
115    return Math.min(
116        available.getMemory() / required.getMemory(), 
117        available.getVirtualCores() / required.getVirtualCores());
118  }

However, when searching in the Apache JIRA, only one ticket was found: 

https://issues.apache.org/jira/browse/YARN-3001

There's a patch attached in it after the ticket was closed and no more replies besides the patch's author's comment. Does the patch really fix the problem?

This crash happens only one time since we upgrade to cdh5.7.3 (~2 years ago). If this patch doesn't work, it'll also appear like it fixes the problem.

 

The patch adds a guard in AbstractYarnScheduler#getMaximumResourceCapability. However, the codes in this function were refactored after Yarn-2.9.0 by this commit (YARN-4719):

https://github.com/apache/hadoop/commit/b4c869309694969cd3f9fda59a6218b32e4d9ece

This refactoring is also applied to the later version of CDH, e.g. in cdh5.15 I can see the new codes.

 

I believe the bug was fixed in the community since there are no later discussions about it, neither fixed by the refactoring commit above or other commits later. Does anyone know more about this?

 

However, we can't upgrade our Yarn version in the near future. We still need a patch for the cdh5.7.3 version. Any advice for the patch?

 

Hopes someone in the Yarn community can help us. Thanks!

Highlighted
Cloudera Employee
Posts: 264
Registered: ‎01-16-2014

Re: Yarn RM crash by divided by zero error

Hi stigahuang,

 

Cloudera has deprecated the capacity scheduler in CDH 5.8 as per the documentations: deprecated items.

We recommend that you move to the FairScheduler which we fully test and support. The difference between the upstream CS and what is in CDH is big and it is really difficult to say if a specific change would fix your issue or if a combination of changes would be needed.

 

Wilfred

Announcements