Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

NodeManager Health Summary

Highlighted

NodeManager Health Summary

hi all

in YARN Alerts we saw the following critical alarm

1 NodeManager is unhealthy.

we have 36 data node machines that include ( DATANODE , metrics monitor , node manager )

97501-capture.png

since one of the datanode is the problem , then we need to find the problematic machine

can we get advice how to find the datanode with this alert?

Michael-Bronson
5 REPLIES 5

Re: NodeManager Health Summary

Mentor

@Michael Bronson

Nodemanager is a slave process of YARN so you should drill down the YARN, in my case I just intentionally brought down my node manager so the problematic Nodemanager should show.

96586-bronson.jpg


Go to the ResourceManager UI check the nodes link on the left side of the screen. All your NodeManagers should be listed there and the reason for it being listed as unhealthy may be shown here. It is most likely due to yarn local dirs or log dirs. You may be hitting the disk threshold for this.

Finally checks the logs look in /var/log/hadoop-yarn/yarn and NOT in /var/log/hadoop/yarn

Re: NodeManager Health Summary

you said "ll your NodeManagers should be listed there and the reason for it being listed as unhealthy may be shown here"


but I not see anything about health nodemanager


see please the follwing:


97503-capture.png

Michael-Bronson

Re: NodeManager Health Summary

@Geoffrey Shelton Okot , regarding my last comment , do you any suggestion how to find the problematic naodemanager ?

Michael-Bronson

Re: NodeManager Health Summary

@Geoffrey Shelton Okot any suggustion?

Michael-Bronson

Re: NodeManager Health Summary

New Contributor

Go to ResourceManager UI on Ambari. Click nodes link on the left side of the window. It should show all Node Managers and the reason for it being listed as unhealthy.

Mostly found reasons are regarding disk space threshold reached. In that case needs to consider following parameters

Parameters Default value Description
yarn.nodemanager.disk-health-checker.min-healthy-disks 0.25 The minimum fraction of number of disks to be healthy for the node manager to launch new containers. This correspond to both yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs. i.e. If there are less number of healthy local-dirs (or log-dirs) available, then new containers will not be launched on this node.
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage 90.0 The maximum percentage of disk space utilization allowed after which a disk is marked as bad. Values can range from 0.0 to 100.0. If the value is greater than or equal to 100, the nodemanager will check for full disk. This applies to yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs.
yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb 0 The minimum space that must be available on a disk for it to be used. This applies to yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs.

In the final step, if above steps do not reveal the actual problem , needs to check log , location path : /var/log/hadoop-yarn/yarn.