Created 02-07-2017 01:42 PM
Why is it that in Ambari (2.4.1.0), and alert is thrown for a node as being stale (I just decommissioned it, it is shutdown now), when the server is in maintenance mode? While the node is decommissioning, there is no alert. Is there a way to temporarily take the node out equation if it is in maintenance mode, so the management nodes do not complain about the stale node?
Created 02-07-2017 01:47 PM
Can you specify which alert is being triggered? Most likely, it's an alert based on a master service's metrics. For example, if you decommission a DataNode and you place that DataNode into Maintenance Mode, then Ambari won't file alerts for it. However, if the NameNode broadcasts a metric that indicates there's a problem with the liveliness of the DataNodes, then Ambari will display that alert.
This is because the master service is running on a separate machine and doesn't care about the maintenance mode of he affected slave. Each service is different - some services understand that a decommission means that the node shouldn't be stale and some still create the metric to indicate staleness for a short period of time.
Created 02-07-2017 01:47 PM
Can you specify which alert is being triggered? Most likely, it's an alert based on a master service's metrics. For example, if you decommission a DataNode and you place that DataNode into Maintenance Mode, then Ambari won't file alerts for it. However, if the NameNode broadcasts a metric that indicates there's a problem with the liveliness of the DataNodes, then Ambari will display that alert.
This is because the master service is running on a separate machine and doesn't care about the maintenance mode of he affected slave. Each service is different - some services understand that a decommission means that the node shouldn't be stale and some still create the metric to indicate staleness for a short period of time.
Created 02-07-2017 01:51 PM
The alert is "DataNode Health Summary, DataNode Health: [Live=29, Stale=0, Dead=1]". I gues then you're right, no way to avoid such a situation. Thank you for the explanation.
Created 02-07-2017 02:03 PM
I think this goes back to the whole "dead is bad" theory. If I recall correctly, there was a metric Ambari was monitoring once on HBase - it was for "Dead RegionServers". We incorrectly assumed that "dead" was "bad". Because of this, while decommissioning a RegionServer, alerts would trigger (and not go away for a long time).
In the end, it was determined that this metric wasn't really something which needed alerting on.
HDFS is a little different - I believe that a DataNode is marked as stale if it hasn't reported in within 30 seconds and marked as dead if it hasn't reported within 1 minute. The problem here is that action is taken by the NameNode in this case - it will begin replicating blocks when it believes a DataNode is dead. So, we alert on it since it's something that is actively causing changes in the cluster data.
The NameNode actually has metrics for differentiating "dead" vs "decommissiong dead":
"NumLiveDataNodes": 3, "NumDeadDataNodes": 1, "NumDecomLiveDataNodes": 0, "NumDecomDeadDataNodes": 1,
In the above example, Ambari won't worry about dead nodes which are marked as known decommissioning, but we will worry about this which are unexpected.