Hi Folks
Hope all are doing good.!
We are using HDP 2.6.5 and we are using 20 nodes of cluster. Everyday we are getting NodeManager health issue and connection refused and sometimes Nodemanager restart itself. i got logs from nodemanager log file:
SHUTDONN LOGS:
2019-04-02 22:28:20,947 INFO monitor.ContainersMonitorImpl - Memory usage of ProcessTree 6948 for container-id container_e17_1553506205851_46103_01_000054
: 520.5 MB of 2 GB physical memory used; 3.6 GB of 4.2 GB virtual memory used
2019-04-02 22:28:20,948 WARN monitor.ContainersMonitorImpl - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is i
nterrupted. Exiting.
2019-04-02 22:28:21,162 INFO launcher.ContainerLaunch - Container container_e17_1553506205851_46103_01_000055 succeeded
2019-04-02 22:28:21,660 INFO ipc.Server - Stopping server on 8040
2019-04-02 22:28:21,661 INFO ipc.Server - Stopping IPC Server Responder
2019-04-02 22:28:21,662 INFO localizer.ResourceLocalizationService - Public cache exiting
2019-04-02 22:28:21,663 INFO ipc.Server - Stopping IPC Server listener on 8040
2019-04-02 22:28:21,683 INFO nodemanager.NodeManager - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at wn
NodeManager connection refused and bad health:
2019-04-05 06:00:02,972 INFO nodemanager.NodeStatusUpdaterImpl - Sending out 61 NM container statuses: [[container_e23_1554290874215_9521_01_000002, Creat
eTime: 1554442308478, State: RUNNING, Capability: <memory:4096, vCores:1>, Diagnostics: , ExitStatus: -1000, Priority: 0], [container_e23_1554290874215_960
7_01_000075, CreateTime: 1554443718725, State: COMPLETE, Capability: <memory:2048, vCores:1>, Diagnostics: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
, ExitStatus: 143, Priority: 20], [container_e23_1554290874215_9607_01_000076, CreateTime: 1554443718726, State: COMPLETE, Capability: <memory:2048, vCores
:1>, Diagnostics: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Without giving any error, AM killing the containers.
Could someone help me to sort out this issue?
Regards,
VInay K