we have huge production Hadoop cluster, with HDP version 2.6.5 and ambari version 18.104.22.168 , and all machines are with OS RHEL 7.6 version
the cluster size is as the following :
Total workers machines - 425 ( each worker include data node and node manager service )
from time to time we get indication of lost one or two **node-manager** and this identified from Ambari as ( 424/425 when total node-manager are 425 )
in order to fix it we just restart the **node-manager** and this action fix the problem and as results we get 425/425
after some googling , we found the following parameters that maybe should be tune better
yarn.client.nodemanager-connect.max-wait-ms ( its configured to 60000 ms and we think to increase it )
yarn.client.nodemanager-connect.retry-interval-ms ( its configured to 10 sec ms and we think to increase it )
yarn.nm.liveness-monitor.expiry-interval-ms ( this parameter not configured yet and we think to add it with value of 1500000 ms )
based on above details , I will appreciate to get comments or others ideas
NodeManager is LOST means that ResourceManager haven't received heartbeats from it for a duration of nm.liveness-monitor.expiry-interval-ms milliseconds (default is 10 minutes).
Hadoop uses the attribute dfs.hosts.exclude in hdfs-site.xml as a pointer to a file where node exclusions should be adequately documented.
Since there is no default value for this attribute, the Hadoop cluster will not exclude any nodes in the absence of a file location and a file in the absence of dfs.hosts.exclude
If dfs.hosts.exclude is not set in your cluster, take the actions listed below.
Add the hostname to the file specified in dfs.hosts.exclude that you intend to remove when dfs.hosts.exclude is already configured.
Run the following command to exclude a data node After adding the hostname to the exclusion run the below command to exclude the node from functioning as a Datanode after adding the hostname to the exclusion.
Below command will exclude the node from functioning as a Node Manager
After the above actions, you should see one data node marked as decommissioned in Ambari. No data blocks will be sent to this data node as YARN has already marked it as unusable
Hope that answers your question
Dear @Shelton , long time that we not meet , glad to see you again
back to my Question , since we are talking on node manager , my goal is to avoid cases like node-manager service is die or not sync with the resource manager , please forgive me but I not understand why you talking about data node and exclude data node from the cluster , because the question is on different subject , and as I mention we want to understand the root cause of lost node manager and how to do proactive steps in order to avoid such of this problems
additionally as I understand most of this problems are as results of bad network that break the connectivity between node manager to resources manager , so in spite some times this behavior is happening , we are trying to set the configuration that give the cluster to be stable in spite all networking problems or INFA problems
let me know if my question is clear so we can continue with our discussion , and sorry again if my first post was not clearly
I want to say also that node-manager restart or fully restart of yarn service fixed the problem , but as you know this isn't the right solution that should be every time that one of the node manager became die