- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Yarn NodeManager fails to start and crashing with SIGBUS
Created on ‎04-20-2018 04:49 AM - edited ‎09-16-2022 06:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Hi,
In CDH 5.12.0 and 5.14.2 releases (centos 6.9) the Yarn NodeManager fails to start and crashing with SIGBUS.
Here is the error msg in :
# # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x00007f4d5b1aff4f, pid=20067, tid=0x00007f4d869dd700 # # JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libleveldbjni-64-1-5336493915245210176.8+0x4af4f] snappy::RawUncompress(snappy::Source*, char*)+0x31f # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /var/run/cloudera-scm-agent/process/14104-yarn-NODEMANAGER/hs_err_pid20067.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. #
Here is the hs_err_pid20067.log file: https://ufile.io/dl8lu
JIRA link: https://issues.apache.org/jira/browse/YARN-8190
Created ‎05-17-2018 01:12 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Is this consistently occurring on all your NodeManagers?
- Did this start occurring after you upgraded? If yes, what was the earlier version and the upgraded version?
- Did this instead start occurring after an abrupt restart of the daemon or the host?
- Do you have NodeManager logs covering the earliest time period this issue was observed? Could you share those here?
Overall this appears to be related to NodeManager's container recovery feature (a corruption of the data stored for this feature in the local filesystem of the NodeManager) and you should be able to bypass the issue if you (re)moved the contents of /var/lib/hadoop-yarn/yarn-nm-recovery/ directory on the affected NodeManagers. This effectively resets the states maintained, which should be OK to perform on a NodeManager that is down.
Full trace for posterity:
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j org.fusesource.leveldbjni.internal.NativeDB$DBJNI.Get(JLorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;J)J+0
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;)[B+22
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeBuffer;)[B+10
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;[B)[B+20
j org.fusesource.leveldbjni.internal.JniDB.get([BLorg/iq80/leveldb/ReadOptions;)[B+27
j org.fusesource.leveldbjni.internal.JniDB.get([B)[B+26
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadVersion()Lorg/apache/hadoop/yarn/server/records/Version;+9
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.checkVersion()V+1
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(Lorg/apache/hadoop/conf/Configuration;)V+10
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(Lorg/apache/hadoop/conf/Configuration;)V+2
j org.apache.hadoop.service.AbstractService.init(Lorg/apache/hadoop/conf/Configuration;)V+80
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(Lorg/apache/hadoop/conf/Configuration;)V+98
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(Lorg/apache/hadoop/conf/Configuration;)V+20
j org.apache.hadoop.service.AbstractService.init(Lorg/apache/hadoop/conf/Configuration;)V+80
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(Lorg/apache/hadoop/conf/Configuration;Z)V+50
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.main([Ljava/lang/String;)V+39
Created ‎05-17-2018 01:12 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Is this consistently occurring on all your NodeManagers?
- Did this start occurring after you upgraded? If yes, what was the earlier version and the upgraded version?
- Did this instead start occurring after an abrupt restart of the daemon or the host?
- Do you have NodeManager logs covering the earliest time period this issue was observed? Could you share those here?
Overall this appears to be related to NodeManager's container recovery feature (a corruption of the data stored for this feature in the local filesystem of the NodeManager) and you should be able to bypass the issue if you (re)moved the contents of /var/lib/hadoop-yarn/yarn-nm-recovery/ directory on the affected NodeManagers. This effectively resets the states maintained, which should be OK to perform on a NodeManager that is down.
Full trace for posterity:
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j org.fusesource.leveldbjni.internal.NativeDB$DBJNI.Get(JLorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;J)J+0
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;)[B+22
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;Lorg/fusesource/leveldbjni/internal/NativeBuffer;)[B+10
j org.fusesource.leveldbjni.internal.NativeDB.get(Lorg/fusesource/leveldbjni/internal/NativeReadOptions;[B)[B+20
j org.fusesource.leveldbjni.internal.JniDB.get([BLorg/iq80/leveldb/ReadOptions;)[B+27
j org.fusesource.leveldbjni.internal.JniDB.get([B)[B+26
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadVersion()Lorg/apache/hadoop/yarn/server/records/Version;+9
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.checkVersion()V+1
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(Lorg/apache/hadoop/conf/Configuration;)V+10
j org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(Lorg/apache/hadoop/conf/Configuration;)V+2
j org.apache.hadoop.service.AbstractService.init(Lorg/apache/hadoop/conf/Configuration;)V+80
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(Lorg/apache/hadoop/conf/Configuration;)V+98
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(Lorg/apache/hadoop/conf/Configuration;)V+20
j org.apache.hadoop.service.AbstractService.init(Lorg/apache/hadoop/conf/Configuration;)V+80
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(Lorg/apache/hadoop/conf/Configuration;Z)V+50
j org.apache.hadoop.yarn.server.nodemanager.NodeManager.main([Ljava/lang/String;)V+39
Created ‎05-18-2018 07:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Harsh J
It's only in one NodeManager, its happen suddenly without any upgrade in CDH 5.12.0 and even if I upgrade to 5.14.2 the issue persist..
Anyway your solution has resolve the issue.
Thank you.
