Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

node manager is getting down after few seconds

avatar

I have a cluster of 6 data nodes and on starting the node manager it is getting down in every data nodes. i checked the logs at

/var/log/hadoop-yarn/yarn and there is no error message.

this is the warning message i got.

2018-07-06 01:53:37,256 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2018-07-06 01:53:39,261 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2018-07-06 01:53:44,913 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container container_1530855405184_0121_02_000001 is : 143 2018-07-06 01:53:44,933 WARN nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=dr.whoOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: EXITED_WITH_FAILUREAPPID=application_1530855405184_0121CONTAINERID=container_1530855405184_0121_02_000001 2018-07-06 01:56:11,578 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.

3 REPLIES 3

avatar
@Punit kumar

The log trace says, that the container has exited with code 143 and we don't have any trace of nodemanager going down. Do check the logs again or provide the same here.

avatar

@Sandeep Nemuri

i checked the log again,this time i got this error. and code 143 error

2018-07-09 04:20:14,067 WARN  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:handle(1083)) - Event EventType: FINISH_APPLICATION sent to absent application application_1531115940804_0668
2018-07-09 04:20:15,985 WARN  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2018-07-09 04:20:16,705 WARN  localizer.ResourceLocalizationService (ResourceLocalizationService.java:update(1023)) - { hdfs://ip-172-31-17-251.ec2.internal:8020/tmp/hive/root/_tez_session_dir/41b28f9d-f7ae-4652-b9a7-ffed72220a41/.tez/application_1531115940804_0607/tez.session.local-resources.pb, 1531123411553, FILE, null } failed: File does not exist: hdfs://ip-172-31-17-251.ec2.internal:8020/tmp/hive/root/_tez_session_dir/41b28f9d-f7ae-4652-b9a7-ffed72220a41/.tez/application_1531115940804_0607/tez.session.local-resources.pb
2018-07-09 04:20:16,719 WARN  nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=rootOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: LOCALIZATION_FAILEDAPPID=application_1531115940804_0607CONTAINERID=container_1531115940804_0607_02_000001
2018-07-09 04:20:16,719 WARN  ipc.Client (Client.java:call(1446)) - interrupted waiting to send rpc request to server
2018-07-09 04:20:52,186 WARN  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2018-07-09 04:25:34,693 WARN  containermanager.AuxServices (AuxServices.java:serviceInit(130)) - The Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class class org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'. Because these are not the same tools trying to send ServiceData and read Service Meta Data may have issues unless the refer to the name in the config.
2018-07-09 04:25:34,761 WARN  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:serviceInit(154)) - NodeManager configured with 61.4 G physical memory allocated to containers, which is more than 80% of the total physical memory available (62.9 G). Thrashing might happen.
2018-07-09 04:25:35,592 WARN  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:handle(1067)) - Event EventType: KILL_CONTAINER sent to absent container container_1531115940804_0760_01_000001
2018-07-09 04:25:35,592 WARN  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:handle(1083)) - Event EventType: FINISH_APPLICATION sent to absent application application_1531115940804_0760 

i have 6 DD( 62GB RAM, 16vcore).

mappred-sire.xml

mapreduce.map.java.opts  -Xmx6656m
mapreduce.map.memory.mb    10000
mapreduce.reduce.java.opts -Xmx12800m
mapreduce.reduce.memory.mb  16000 

yarn-site.xml

yarn.nodemanager.resource.memory-mb      62900
yarn.scheduler.minimum-allocation-mb     6656
yarn.scheduler.maximum-allocation-mb     62900

avatar
New Contributor

yarn-hdfs-nodemanagerlog.tar.gzI have same problem with @Punkit Kumar. I have 5 VM (8GB RAM, 100GB HDD data, 4vcpu) amazon and install HDP 2.6 HDP 2.5 anything is fine exception one node management auto stop after few seconds.

The first time I think problem is ambari config wrong but after I try to manual installation Hadoop: 2.7.1; 2.7.3; 2.8.4 it's same a problem. Install Hadoop ok all namenode, datanode, resource management is running but when check with map reduce job (https://github.com/asmith26/python-mapreduce-examples) namenode auto stop. Plz, see config haddop.tar.gz file.

I had checked jdk1.7.0_67 & jdk1.8.0_112 same problem 😞