Created 06-15-2018 02:03 AM
I have 3 node manager in my cluster and they are failing soon after restart..
I tried deleting /var/log/hadoop-yarn/nodemanager/recovery-state directory but no help...
Please help. Where can I find the log to see why is this failing...
Created 06-15-2018 02:31 AM
As you mentioned that the node managers are failing soon after restart so do you see any error in the nodemanager logs. Can you please share the logs?
Created 06-26-2018 08:34 PM
@Jay Kumar SenSharma - here is the log. When I restart it it stays up for sometime before going down.
[root@D02 yarn]# tail -10 yarn-yarn-nodemanager-D02.asotc.com.log 2018-06-26 16:15:09,830 INFO shuffle.ExternalShuffleBlockResolver (ExternalShuffleBlockResolver.java:application) - Application application_1527390036186_1818 removed, cleanupLocalDirs = false 2018-06-26 16:15:09,854 INFO application.ApplicationImpl (ApplicationImpl.java:handle(464)) - Application applic0036186_1818 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED 2018-06-26 16:15:09,854 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:finishLogAggregationlication just finished : application_1527390036186_1818 2018-06-26 16:15:10,267 INFO zlib.ZlibFactory (ZlibFactory.java:<clinit>(49)) - Successfully loaded & initializeb library 2018-06-26 16:15:10,356 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1527390000001 (auth:SIMPLE) 2018-06-26 16:15:10,377 INFO compress.CodecPool (CodecPool.java:getCompressor(153)) - Got brand-new compressor [ 2018-06-26 16:15:10,391 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doContainerLogAggreg- Uploading logs for container container_e297_1527390036186_1818_03_000001. Current good log dirs are /data/hadoo 2018-06-26 16:15:10,454 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInte Start request for container_e297_1527390036186_1825_01_000001 by user dr.who 2018-06-26 16:15:10,455 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInte Creating a new application reference for app application_1527390036186_1825 2018-06-26 16:15:10,473 INFO application.ApplicationImpl (ApplicationImpl.java:handle(464)) - Application applic0036186_1825 transitioned from NEW to INITING 2018-06-26 16:15:10,485 WARN logaggregation.LogAggregationService (LogAggregationService.java:verifyAndCreateRem5)) - Remote Root Log Dir [/app-logs] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found]. The cluster may have problems with multiple users. 2018-06-26 16:15:10,486 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(190)) - rollierval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this applinished. 2018-06-26 16:15:10,557 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=dr.who I0 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application86_1825 CONTAINERID=container_e297_1527390036186_1825_01_000001 2018-06-26 16:15:10,574 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(46ng path : /data/hadoop/yarn/log/application_1527390036186_1818/container_e297_1527390036186_1818_03_000001/launchh 2018-06-26 16:15:10,575 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(46ng path : /data/hadoop/yarn/log/application_1527390036186_1818/container_e297_1527390036186_1818_03_000001/direct 2018-06-26 16:15:10,679 INFO application.ApplicationImpl (ApplicationImpl.java:transition(304)) - Adding contain390036186_1825_01_000001 to application application_1527390036186_1825 2018-06-26 16:15:10,680 INFO application.ApplicationImpl (ApplicationImpl.java:handle(464)) - Application applic0036186_1825 transitioned from INITING to RUNNING 2018-06-26 16:15:10,680 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e29186_1825_01_000001 transitioned from NEW to LOCALIZED 2018-06-26 16:15:10,680 INFO containermanager.AuxServices (AuxServices.java:handle(196)) - Got event CONTAINER_Id application_1527390036186_1825 2018-06-26 16:15:10,680 INFO yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(183)) - Initiainer container_e297_1527390036186_1825_01_000001 2018-06-26 16:15:10,902 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e29186_1825_01_000001 transitioned from LOCALIZED to RUNNING 2018-06-26 16:15:10,910 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:buildCommandExe- launchContainer: [bash, /data/hadoop/yarn/local/usercache/dr.who/appcache/application_1527390036186_1825/contai7390036186_1825_01_000001/default_container_executor.sh]
Created 06-18-2018 03:53 AM
@Prakash Punj cleanup if there are any stale pid files for node manager, if the same server hosting region server then try stopping RS and first start NM and then RS.
also look for any possible zombie process.