Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

NodeManager is failing soon after restart

NodeManager is failing soon after restart

Rising Star

I have 3 node manager in my cluster and they are failing soon after restart..

I tried deleting /var/log/hadoop-yarn/nodemanager/recovery-state directory but no help...

Please help. Where can I find the log to see why is this failing...

3 REPLIES 3

Re: NodeManager is failing soon after restart

Super Mentor

@Prakash Punj

As you mentioned that the node managers are failing soon after restart so do you see any error in the nodemanager logs. Can you please share the logs?

Re: NodeManager is failing soon after restart

Rising Star

@Jay Kumar SenSharma - here is the log. When I restart it it stays up for sometime before going down.

[root@D02 yarn]# tail -10 yarn-yarn-nodemanager-D02.asotc.com.log


2018-06-26 16:15:09,830 INFO  shuffle.ExternalShuffleBlockResolver (ExternalShuffleBlockResolver.java:application) - Application application_1527390036186_1818 removed, cleanupLocalDirs = false
2018-06-26 16:15:09,854 INFO  application.ApplicationImpl (ApplicationImpl.java:handle(464)) - Application applic0036186_1818 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED
2018-06-26 16:15:09,854 INFO  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:finishLogAggregationlication just finished : application_1527390036186_1818
2018-06-26 16:15:10,267 INFO  zlib.ZlibFactory (ZlibFactory.java:<clinit>(49)) - Successfully loaded & initializeb library
2018-06-26 16:15:10,356 INFO  ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1527390000001 (auth:SIMPLE)
2018-06-26 16:15:10,377 INFO  compress.CodecPool (CodecPool.java:getCompressor(153)) - Got brand-new compressor [
2018-06-26 16:15:10,391 INFO  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doContainerLogAggreg- Uploading logs for container container_e297_1527390036186_1818_03_000001. Current good log dirs are /data/hadoo
2018-06-26 16:15:10,454 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInte Start request for container_e297_1527390036186_1825_01_000001 by user dr.who
2018-06-26 16:15:10,455 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInte Creating a new application reference for app application_1527390036186_1825
2018-06-26 16:15:10,473 INFO  application.ApplicationImpl (ApplicationImpl.java:handle(464)) - Application applic0036186_1825 transitioned from NEW to INITING
2018-06-26 16:15:10,485 WARN  logaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRem5)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found]. The cluster may have problems with multiple users.
2018-06-26 16:15:10,486 WARN  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(190)) - rollierval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this applinished.
2018-06-26 16:15:10,557 INFO  nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=dr.who       I0       OPERATION=Start Container Request       TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application86_1825 CONTAINERID=container_e297_1527390036186_1825_01_000001
2018-06-26 16:15:10,574 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(46ng path : /data/hadoop/yarn/log/application_1527390036186_1818/container_e297_1527390036186_1818_03_000001/launchh
2018-06-26 16:15:10,575 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(46ng path : /data/hadoop/yarn/log/application_1527390036186_1818/container_e297_1527390036186_1818_03_000001/direct
2018-06-26 16:15:10,679 INFO  application.ApplicationImpl (ApplicationImpl.java:transition(304)) - Adding contain390036186_1825_01_000001 to application application_1527390036186_1825
2018-06-26 16:15:10,680 INFO  application.ApplicationImpl (ApplicationImpl.java:handle(464)) - Application applic0036186_1825 transitioned from INITING to RUNNING
2018-06-26 16:15:10,680 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e29186_1825_01_000001 transitioned from NEW to LOCALIZED
2018-06-26 16:15:10,680 INFO  containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_Id application_1527390036186_1825
2018-06-26 16:15:10,680 INFO  yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(183)) - Initiainer container_e297_1527390036186_1825_01_000001
2018-06-26 16:15:10,902 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e29186_1825_01_000001 transitioned from LOCALIZED to RUNNING
2018-06-26 16:15:10,910 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:buildCommandExe- launchContainer: [bash, /data/hadoop/yarn/local/usercache/dr.who/appcache/application_1527390036186_1825/contai7390036186_1825_01_000001/default_container_executor.sh]



Highlighted

Re: NodeManager is failing soon after restart

New Contributor

@Prakash Punj cleanup if there are any stale pid files for node manager, if the same server hosting region server then try stopping RS and first start NM and then RS.

also look for any possible zombie process.