Support Questions

Find answers, ask questions, and share your expertise

NodeManager Health Summary

avatar

hi all

in YARN Alerts we saw the following critical alarm

1 NodeManager is unhealthy.

we have 36 data node machines that include ( DATANODE , metrics monitor , node manager )

97501-capture.png

since one of the datanode is the problem , then we need to find the problematic machine

can we get advice how to find the datanode with this alert?

Michael-Bronson
5 REPLIES 5

avatar
Master Mentor

@Michael Bronson

Nodemanager is a slave process of YARN so you should drill down the YARN, in my case I just intentionally brought down my node manager so the problematic Nodemanager should show.

96586-bronson.jpg


Go to the ResourceManager UI check the nodes link on the left side of the screen. All your NodeManagers should be listed there and the reason for it being listed as unhealthy may be shown here. It is most likely due to yarn local dirs or log dirs. You may be hitting the disk threshold for this.

Finally checks the logs look in /var/log/hadoop-yarn/yarn and NOT in /var/log/hadoop/yarn

avatar

you said "ll your NodeManagers should be listed there and the reason for it being listed as unhealthy may be shown here"


but I not see anything about health nodemanager


see please the follwing:


97503-capture.png

Michael-Bronson

avatar

@Geoffrey Shelton Okot , regarding my last comment , do you any suggestion how to find the problematic naodemanager ?

Michael-Bronson

avatar

@Geoffrey Shelton Okot any suggustion?

Michael-Bronson

avatar
Explorer

Go to ResourceManager UI on Ambari. Click nodes link on the left side of the window. It should show all Node Managers and the reason for it being listed as unhealthy.

Mostly found reasons are regarding disk space threshold reached. In that case needs to consider following parameters

Parameters Default value Description
yarn.nodemanager.disk-health-checker.min-healthy-disks 0.25 The minimum fraction of number of disks to be healthy for the node manager to launch new containers. This correspond to both yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs. i.e. If there are less number of healthy local-dirs (or log-dirs) available, then new containers will not be launched on this node.
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage 90.0 The maximum percentage of disk space utilization allowed after which a disk is marked as bad. Values can range from 0.0 to 100.0. If the value is greater than or equal to 100, the nodemanager will check for full disk. This applies to yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs.
yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb 0 The minimum space that must be available on a disk for it to be used. This applies to yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs.

In the final step, if above steps do not reveal the actual problem , needs to check log , location path : /var/log/hadoop-yarn/yarn.