04-23-2019 10:57 AM
hello folks ,
the nodeManager has suddently stopped in a instance (while stille running for other nodes/intances ).
so when I try to start/restart it -via cloudera manager - , an error is shown in the first step :
Failed to start role.
and I'm using CentOS release 6.10 (Final)
please what do you suggest me to look or check in order to resolve this problem ?
here's my stdout log :
Tue Apr 23 10:18:56 PDT 2019 JAVA_HOME=/usr/java/jdk.1.8.0_144 using /usr/java/jdk.1.8.0_144 as JAVA_HOME using 5 as CDH_VERSION using /opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/hadoop-yarn as CDH_YARN_HOME using /opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/hadoop-mapreduce as CDH_MR2_HOME using /var/run/cloudera-scm-agent/process/23960-yarn-NODEMANAGER as CONF_DIR CONF_DIR=/var/run/cloudera-scm-agent/process/23960-yarn-NODEMANAGER CMF_CONF_DIR=/etc/cloudera-scm-agent # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x00007f8c1fde51a1, pid=3004, tid=0x00007f8c4f44c700 # # JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libleveldbjni-64-1-8170950501904951615.8+0x491a1] leveldb::ReadBlock(leveldb::RandomAccessFile*, leveldb::ReadOptions const&, leveldb::BlockHandle const&, leveldb::BlockContents*)+0x191 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again #
and this is my log.out error :
Node Manager health check script is not available or doesn't have execute permission, so not starting the node health script runner.
Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher
Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher
Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizationEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService
Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServicesEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices
Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncherEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher
04-23-2019 11:44 AM
"SIGBUS (0x7)" can mean a few things, but one of the most common ones is that a directory that Java needs to use is full (no more free disk space).
The fact that your Node Manager was running and then failed and then failed to start supports that type of possible cause.
Since the crash is in libleveldbjni, that gives us more evidence that a directory may be full since that indicates Java was accessing local files (on disk).
I would suggest checking disk space on all volumes on that host. If there a volume that is full, then try freeing up some space and start the Nodemanager again.
04-24-2019 02:27 AM
thank you for your feedback and your clear explanation ,
in fact the problem was resolved by removing the contents of /var/lib/hadoop-yarn/yarn-nm-recovery/ directory and then the Nodemanager role started successfully.
the solution that I've found was from :