Support Questions

Find answers, ask questions, and share your expertise

Yarn NodeManager fails to start and crashing with SIGBUS

avatar
Master Collaborator

Hi,
Hi,
In CDH 5.12.0 and 5.14.2 releases (centos 6.9) the Yarn NodeManager fails to start and crashing with SIGBUS.
Here is the error msg in :

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007f4d5b1aff4f, pid=20067, tid=0x00007f4d869dd700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libleveldbjni-64-1-5336493915245210176.8+0x4af4f]  snappy::RawUncompress(snappy::Source*, char*)+0x31f
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /var/run/cloudera-scm-agent/process/14104-yarn-NODEMANAGER/hs_err_pid20067.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Here is the hs_err_pid20067.log file: https://ufile.io/dl8lu

JIRA linkhttps://issues.apache.org/jira/browse/YARN-8190

1 ACCEPTED SOLUTION

avatar
Mentor
The pattern of your issue isn't clear - could you help answer a few more questions?

- Is this consistently occurring on all your NodeManagers?
- Did this start occurring after you upgraded? If yes, what was the earlier version and the upgraded version?
- Did this instead start occurring after an abrupt restart of the daemon or the host?
- Do you have NodeManager logs covering the earliest time period this issue was observed? Could you share those here?

Overall this appears to be related to NodeManager's container recovery feature (a corruption of the data stored for this feature in the local filesystem of the NodeManager) and you should be able to bypass the issue if you (re)moved the contents of /var/lib/hadoop-yarn/yarn-nm-recovery/ directory on the affected NodeManagers. This effectively resets the states maintained, which should be OK to perform on a NodeManager that is down.

Full trace for posterity:

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j org.fusesource.leveldbjni.internal.NativeDB$DBJNI.Get(JLorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;J)J+0
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;)[B+22
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeBuffer;)[B+10
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;[B)[B+20
j org.fusesource.leveldbjni.internal.JniDB.get([BLorg/iq80/leveldb/ReadOptions;)[B+27
j org.fusesource.leveldbjni.internal.JniDB.get([B)[B+26
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadVersion()Lorg/apache/hadoop/yarn/server/records/Version;+9
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.checkVersion()V+1
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(Lorg/apache/hadoop/conf/Configuration;)V+10
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(Lorg/apache/hadoop/conf/Configuration;)V+2
j org.apache.hadoop.service.AbstractService.init(Lorg/apache/hadoop/conf/Configuration;)V+80
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(Lorg/apache/hadoop/conf/Configuration;)V+98
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(Lorg/apache/hadoop/conf/Configuration;)V+20
j org.apache.hadoop.service.AbstractService.init(Lorg/apache/hadoop/conf/Configuration;)V+80
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(Lorg/apache/hadoop/conf/Configuration;Z)V+50
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.main([Ljava/lang/String;)V+39

View solution in original post

2 REPLIES 2

avatar
Mentor
The pattern of your issue isn't clear - could you help answer a few more questions?

- Is this consistently occurring on all your NodeManagers?
- Did this start occurring after you upgraded? If yes, what was the earlier version and the upgraded version?
- Did this instead start occurring after an abrupt restart of the daemon or the host?
- Do you have NodeManager logs covering the earliest time period this issue was observed? Could you share those here?

Overall this appears to be related to NodeManager's container recovery feature (a corruption of the data stored for this feature in the local filesystem of the NodeManager) and you should be able to bypass the issue if you (re)moved the contents of /var/lib/hadoop-yarn/yarn-nm-recovery/ directory on the affected NodeManagers. This effectively resets the states maintained, which should be OK to perform on a NodeManager that is down.

Full trace for posterity:

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j org.fusesource.leveldbjni.internal.NativeDB$DBJNI.Get(JLorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;J)J+0
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;)[B+22
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeBuffer;)[B+10
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;[B)[B+20
j org.fusesource.leveldbjni.internal.JniDB.get([BLorg/iq80/leveldb/ReadOptions;)[B+27
j org.fusesource.leveldbjni.internal.JniDB.get([B)[B+26
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadVersion()Lorg/apache/hadoop/yarn/server/records/Version;+9
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.checkVersion()V+1
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(Lorg/apache/hadoop/conf/Configuration;)V+10
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(Lorg/apache/hadoop/conf/Configuration;)V+2
j org.apache.hadoop.service.AbstractService.init(Lorg/apache/hadoop/conf/Configuration;)V+80
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(Lorg/apache/hadoop/conf/Configuration;)V+98
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(Lorg/apache/hadoop/conf/Configuration;)V+20
j org.apache.hadoop.service.AbstractService.init(Lorg/apache/hadoop/conf/Configuration;)V+80
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(Lorg/apache/hadoop/conf/Configuration;Z)V+50
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.main([Ljava/lang/String;)V+39

avatar
Master Collaborator

Hi @Harsh J

It's only in one NodeManager, its happen suddenly without any upgrade in CDH 5.12.0 and even if I upgrade to 5.14.2 the issue persist..
Anyway your solution has resolve the issue.

Thank you.