question Node in maintenance mode throws stale alert from management nodes in Archives of Support Questions (Read Only)

Node in maintenance mode throws stale alert from management nodes

mtdeguzis — Tue, 07 Feb 2017 21:42:09 GMT

Why is it that in Ambari (2.4.1.0), and alert is thrown for a node as being stale (I just decommissioned it, it is shutdown now), when the server is in maintenance mode? While the node is decommissioning, there is no alert. Is there a way to temporarily take the node out equation if it is in maintenance mode, so the management nodes do not complain about the stale node?

Re: Node in maintenance mode throws stale alert from management nodes

jonathanhurley — Tue, 07 Feb 2017 21:47:30 GMT

Can you specify which alert is being triggered? Most likely, it's an alert based on a master service's metrics. For example, if you decommission a DataNode and you place that DataNode into Maintenance Mode, then Ambari won't file alerts for it. However, if the NameNode broadcasts a metric that indicates there's a problem with the liveliness of the DataNodes, then Ambari will display that alert.

This is because the master service is running on a separate machine and doesn't care about the maintenance mode of he affected slave. Each service is different - some services understand that a decommission means that the node shouldn't be stale and some still create the metric to indicate staleness for a short period of time.

Re: Node in maintenance mode throws stale alert from management nodes

mtdeguzis — Tue, 07 Feb 2017 21:51:45 GMT

The alert is "DataNode Health Summary, DataNode Health: [Live=29, Stale=0, Dead=1]". I gues then you're right, no way to avoid such a situation. Thank you for the explanation.

Re: Node in maintenance mode throws stale alert from management nodes

jonathanhurley — Tue, 07 Feb 2017 22:03:46 GMT

I think this goes back to the whole "dead is bad" theory. If I recall correctly, there was a metric Ambari was monitoring once on HBase - it was for "Dead RegionServers". We incorrectly assumed that "dead" was "bad". Because of this, while decommissioning a RegionServer, alerts would trigger (and not go away for a long time).

In the end, it was determined that this metric wasn't really something which needed alerting on.

HDFS is a little different - I believe that a DataNode is marked as stale if it hasn't reported in within 30 seconds and marked as dead if it hasn't reported within 1 minute. The problem here is that action is taken by the NameNode in this case - it will begin replicating blocks when it believes a DataNode is dead. So, we alert on it since it's something that is actively causing changes in the cluster data.

The NameNode actually has metrics for differentiating "dead" vs "decommissiong dead":

"NumLiveDataNodes": 3,
"NumDeadDataNodes": 1,
"NumDecomLiveDataNodes": 0,
"NumDecomDeadDataNodes": 1,

In the above example, Ambari won't worry about dead nodes which are marked as known decommissioning, but we will worry about this which are unexpected.