Created 07-09-2018 10:48 AM
i have a 6 DN cluster and every second all the nodemanager is getting down. post the log in
https://community.hortonworks.com/questions/202914/node-manager-is-getting-down-after-few-seconds.ht... and now reducer job is also getting failed.
2018-07-09 06:25:26,262 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2018-07-09 06:25:26,616 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container container_1531130193317_0195_02_000001 is : 143 2018-07-09 06:25:26,624 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container container_1531130193317_0197_01_000001 is : 143 2018-07-09 06:25:26,712 WARN nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=dr.whoOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: EXITED_WITH_FAILUREAPPID=application_1531130193317_0195CONTAINERID=container_1531130193317_0195_02_000001 2018-07-09 06:25:26,819 WARN nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=dr.whoOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: EXITED_WITH_FAILUREAPPID=application_1531130193317_0197CONTAINERID=container_1531130193317_0197_01_000001 2018-07-09 06:25:30,271 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2018-07-09 06:25:30,534 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container container_1531130193317_0198_02_000001 is : 143 2018-07-09 06:25:30,600 WARN nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=dr.whoOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: EXITED_WITH_FAILUREAPPID=application_1531130193317_0198CONTAINERID=container_1531130193317_0198_02_000001 2018-07-09 06:25:31,258 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:<init>(182)) - rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2018-07-09 06:25:31,422 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container container_1531130193317_0200_02_000001 is : 143 2018-07-09 06:25:31,486 WARN nodemanager.NMAuditLogger (NMAuditLogger.java:logFailure(150)) - USER=dr.whoOPERATION=Container Finished - FailedTARGET=ContainerImplRESULT=FAILUREDESCRIPTION=Container failed with state: EXITED_WITH_FAILUREAPPID=application_1531130193317_0200CONTAINERID=container_1531130193317_0200_02_000001
Created 07-11-2018 11:04 PM
Hi @Punit kumar!
AFAIK usually 143 error code it's related to memory/GC issues.
Could you enable the DEBUG mode to Yarn logs?
Also, share with us what kinda job are you running and your app,map,reduce memory properties (the opts as well). And the nodemanager resources too, plz!
Thanks.