Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Failed to start role -YARN- NodeManager (node)

avatar
Rising Star

hello folks , 


the nodeManager has suddently stopped in a instance (while stille running for other nodes/intances ).
so when I try to start/restart it -via cloudera manager - , an error is shown in the first step : 

Failed to start role.

and I'm using CentOS release 6.10 (Final)
please what do you suggest me to look or check in order to resolve this problem ?


here's my stdout log : 

Tue Apr 23 10:18:56 PDT 2019
JAVA_HOME=/usr/java/jdk.1.8.0_144
using /usr/java/jdk.1.8.0_144 as JAVA_HOME
using 5 as CDH_VERSION
using /opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/hadoop-yarn as CDH_YARN_HOME
using /opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/hadoop-mapreduce as CDH_MR2_HOME
using /var/run/cloudera-scm-agent/process/23960-yarn-NODEMANAGER as CONF_DIR
CONF_DIR=/var/run/cloudera-scm-agent/process/23960-yarn-NODEMANAGER
CMF_CONF_DIR=/etc/cloudera-scm-agent
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007f8c1fde51a1, pid=3004, tid=0x00007f8c4f44c700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libleveldbjni-64-1-8170950501904951615.8+0x491a1]  leveldb::ReadBlock(leveldb::RandomAccessFile*, leveldb::ReadOptions const&, leveldb::BlockHandle const&, leveldb::BlockContents*)+0x191
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#

 and this is my log.out error : 

NodeManager

Node Manager health check script is not available or doesn't have execute permission, so not starting the node health script runner.

AsyncDispatcher

Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher

AsyncDispatcher

Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher
AsyncDispatcher

Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizationEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService
AsyncDispatcher

Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServicesEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices

AsyncDispatcher

Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
AsyncDispatcher

Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncherEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher

1 ACCEPTED SOLUTION

avatar
Rising Star

@bgooley  ,

thank you for your feedback and your clear explanation , 


in fact the problem was resolved by removing the contents of /var/lib/hadoop-yarn/yarn-nm-recovery/ directory and then the Nodemanager role started successfully.


the solution that I've found was from   :
https://community.cloudera.com/t5/Batch-Processing-and-Workflow/Yarn-NodeManager-fails-to-start-and-...

View solution in original post

2 REPLIES 2

avatar
Master Guru

@Bildervic,

 

"SIGBUS (0x7)" can mean a few things, but one of the most common ones is that a directory that Java needs to use is full (no more free disk space).

 

The fact that your Node Manager was running and then failed and then failed to start supports that type of possible cause.

 

Since the crash is in libleveldbjni, that gives us more evidence that a directory may be full since that indicates Java was accessing local files (on disk).

 

I would suggest checking disk space on all volumes on that host.  If there a volume that is full, then try freeing up some space and start the Nodemanager again.

 

 

 

avatar
Rising Star

@bgooley  ,

thank you for your feedback and your clear explanation , 


in fact the problem was resolved by removing the contents of /var/lib/hadoop-yarn/yarn-nm-recovery/ directory and then the Nodemanager role started successfully.


the solution that I've found was from   :
https://community.cloudera.com/t5/Batch-Processing-and-Workflow/Yarn-NodeManager-fails-to-start-and-...