Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Yarn + how to avoid node manager that marked as lost

avatar

we have huge production Hadoop cluster, with HDP version 2.6.5 and ambari version 2.6.2.2 , and all machines are with OS RHEL 7.6 version

 

the cluster size is as the following :

 

Total workers machines - 425 ( each worker include data node and node manager service )

 

from time to time we get indication of lost one or two **node-manager** and this identified from Ambari as ( 424/425 when total node-manager  are 425 )

 

in order to fix it we just restart the **node-manager** and this action fix the problem and as results we get 425/425

 

after some googling , we found the following parameters that maybe should be tune better

 

yarn.client.nodemanager-connect.max-wait-ms ( its configured to 60000 ms and we think to increase it )


yarn.client.nodemanager-connect.retry-interval-ms ( its configured to 10 sec ms and we think to increase it )


yarn.nm.liveness-monitor.expiry-interval-ms ( this parameter not configured yet and we think to add it with value of 1500000 ms )

 

based on above details , I will appreciate to get comments or others ideas

 

background:
NodeManager is LOST means that ResourceManager haven't received heartbeats from it for a duration of nm.liveness-monitor.expiry-interval-ms milliseconds (default is 10 minutes).

Michael-Bronson
3 REPLIES 3

avatar
Master Mentor

@mike_bronson7 

Hadoop uses the attribute dfs.hosts.exclude in hdfs-site.xml as a pointer to a file where node exclusions should be adequately documented.
Since there is no default value for this attribute, the Hadoop cluster will not exclude any nodes in the absence of a file location and a file in the absence of dfs.hosts.exclude

If dfs.hosts.exclude is not set in your cluster, take the actions listed below.

  • Shutdown the Namenode.
  • Edit hdfs-site.xml and add a dfs.hosts.exclude entry with the file's location.
  • This can be a text file with the hostname that you intend to remove should be added to the file described in dfs.hosts.exclude.
  • Start the namenode

Add the hostname to the file specified in dfs.hosts.exclude that you intend to remove when dfs.hosts.exclude is already configured.

Run the following command to exclude a data node After adding the hostname to the exclusion run the below command to exclude the node from functioning as a Datanode after adding the hostname to the exclusion.

 

Spoiler
$ hdfs dfsadmin -refreshNodes

 

Below command will exclude the node from functioning as a Node Manager

Spoiler
$ yarn rmadmin -refreshNodes

After the above actions, you should see one data node marked as decommissioned in Ambari. No data blocks will be sent to this data node as YARN has already marked it as unusable
Hope that answers your question

avatar

Dear @Shelton , long time that we not meet , glad to see you again 

back to my Question , since we are talking on node manager ,  my goal is to avoid cases like node-manager service is die or not sync with the resource manager , please forgive me but I not understand why you talking about data node and exclude data node from the cluster , because the question is on different subject , and as I mention we want to understand the root cause of lost node manager and how to do proactive steps in order to avoid such of this problems 

additionally as I understand most of this problems are as results of bad network that break the connectivity between node manager to resources manager , so in spite some times this behavior is happening , we are trying to set the configuration that give the cluster to be stable in spite all networking problems or INFA problems 

 

let me know if my question is clear so we can continue with our discussion , and sorry again if my first post was not clearly 

Michael-Bronson

avatar

I want to say also that node-manager restart or fully restart of yarn service fixed the problem , but as you know this isn't the right solution that should be every time that one of the node manager became die 

Michael-Bronson