NodeManager is failing soon after restart

I have 3 node manager in my cluster and they are failing soon after restart..

I tried deleting /var/log/hadoop-yarn/nodemanager/recovery-state directory but no help...

Please help. Where can I find the log to see why is this failing...


@Prakash Punj

As you mentioned that the node managers are failing soon after restart so do you see any error in the nodemanager logs. Can you please share the logs?

@Jay Kumar SenSharma - here is the log. When I restart it it stays up for sometime before going down.

[root@D02 yarn]# tail -10

2018-06-26 16:15:09,830 INFO  shuffle.ExternalShuffleBlockResolver ( - Application application_1527390036186_1818 removed, cleanupLocalDirs = false
2018-06-26 16:15:09,854 INFO  application.ApplicationImpl ( - Application applic0036186_1818 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED
2018-06-26 16:15:09,854 INFO  logaggregation.AppLogAggregatorImpl ( just finished : application_1527390036186_1818
2018-06-26 16:15:10,267 INFO  zlib.ZlibFactory (<clinit>(49)) - Successfully loaded & initializeb library
2018-06-26 16:15:10,356 INFO  ipc.Server ( - Auth successful for appattempt_1527390000001 (auth:SIMPLE)
2018-06-26 16:15:10,377 INFO  compress.CodecPool ( - Got brand-new compressor [
2018-06-26 16:15:10,391 INFO  logaggregation.AppLogAggregatorImpl ( Uploading logs for container container_e297_1527390036186_1818_03_000001. Current good log dirs are /data/hadoo
2018-06-26 16:15:10,454 INFO  containermanager.ContainerManagerImpl ( Start request for container_e297_1527390036186_1825_01_000001 by user dr.who
2018-06-26 16:15:10,455 INFO  containermanager.ContainerManagerImpl ( Creating a new application reference for app application_1527390036186_1825
2018-06-26 16:15:10,473 INFO  application.ApplicationImpl ( - Application applic0036186_1825 transitioned from NEW to INITING
2018-06-26 16:15:10,485 WARN  logaggregation.LogAggregationService ( - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found]. The cluster may have problems with multiple users.
2018-06-26 16:15:10,486 WARN  logaggregation.AppLogAggregatorImpl (<init>(190)) - rollierval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this applinished.
2018-06-26 16:15:10,557 INFO  nodemanager.NMAuditLogger ( - USER=dr.who       I0       OPERATION=Start Container Request       TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application86_1825 CONTAINERID=container_e297_1527390036186_1825_01_000001
2018-06-26 16:15:10,574 INFO  nodemanager.DefaultContainerExecutor ( path : /data/hadoop/yarn/log/application_1527390036186_1818/container_e297_1527390036186_1818_03_000001/launchh
2018-06-26 16:15:10,575 INFO  nodemanager.DefaultContainerExecutor ( path : /data/hadoop/yarn/log/application_1527390036186_1818/container_e297_1527390036186_1818_03_000001/direct
2018-06-26 16:15:10,679 INFO  application.ApplicationImpl ( - Adding contain390036186_1825_01_000001 to application application_1527390036186_1825
2018-06-26 16:15:10,680 INFO  application.ApplicationImpl ( - Application applic0036186_1825 transitioned from INITING to RUNNING
2018-06-26 16:15:10,680 INFO  container.ContainerImpl ( - Container container_e29186_1825_01_000001 transitioned from NEW to LOCALIZED
2018-06-26 16:15:10,680 INFO  containermanager.AuxServices ( - Got event CONTAINER_Id application_1527390036186_1825
2018-06-26 16:15:10,680 INFO  yarn.YarnShuffleService ( - Initiainer container_e297_1527390036186_1825_01_000001
2018-06-26 16:15:10,902 INFO  container.ContainerImpl ( - Container container_e29186_1825_01_000001 transitioned from LOCALIZED to RUNNING
2018-06-26 16:15:10,910 INFO  nodemanager.DefaultContainerExecutor ( launchContainer: [bash, /data/hadoop/yarn/local/usercache/dr.who/appcache/application_1527390036186_1825/contai7390036186_1825_01_000001/]


@Prakash Punj cleanup if there are any stale pid files for node manager, if the same server hosting region server then try stopping RS and first start NM and then RS.

also look for any possible zombie process.

