Created 10-30-2015 12:38 AM
Datanode Health Summary in Ambari Alerts reported 1 stale node. How to identify which datannode is in stale state?
Created 10-30-2015 04:13 AM
A datanode is considered stale when:
dfs.namenode.stale.datanode.interval < last contact < (2 * dfs.namenode.heartbeat.recheck-interval)
In the NameNode UI Datanodes tab, a stale datanode will stand out due to having a larger value for Last contact among live datanodes (also available in JMX output). When a datanode is stale, it will be given lowest priority for reads and writes.
Using default values, the namenode will consider a datanode stale when its heartbeat is absent for 30 seconds. After another 10 minutes without a heartbeat (10.5 minutes total), a datanode is considered dead.
Relevant properties include:
This feature was introduced by HDFS-3703.
Created 10-30-2015 12:38 AM
probably the namenode logs should say that..
Created 10-30-2015 01:00 AM
@ayusuf@hortonworks.com
This is good explanation ...namenode will know about the stale DN
dfs.namenode.stale.datanode.interval Default time interval for marking a datanode as "stale", i.e., if the namenode has not received heartbeat msg from a datanode for more than this time interval, the datanode will be marked and treated as "stale" by default. The stale interval cannot be too small since otherwise this may cause too frequent change of stale states. We thus set a minimum stale interval value (the default value is 3 times of heartbeat interval) and guarantee that the stale interval cannot be less than the minimum value. A stale data node is avoided during lease/block recovery. It can be conditionally avoided for reads (see dfs.namenode.avoid.read.stale.datanode) and for writes (see dfs.namenode.avoid.write.stale.datanode).
Created 10-30-2015 04:13 AM
A datanode is considered stale when:
dfs.namenode.stale.datanode.interval < last contact < (2 * dfs.namenode.heartbeat.recheck-interval)
In the NameNode UI Datanodes tab, a stale datanode will stand out due to having a larger value for Last contact among live datanodes (also available in JMX output). When a datanode is stale, it will be given lowest priority for reads and writes.
Using default values, the namenode will consider a datanode stale when its heartbeat is absent for 30 seconds. After another 10 minutes without a heartbeat (10.5 minutes total), a datanode is considered dead.
Relevant properties include:
This feature was introduced by HDFS-3703.
Created 10-30-2015 09:43 AM
Nicely explained! Thanks @Alex Miller
Created 10-30-2015 05:31 PM
Thanks Alex. Very good explanation. 😞 I learned this the hard way yesterday night, by bringing the network down (ifdown eth1) while the datanode was up in one of VM nodes and refreshing the Namenode UI -> Datanode tab. @Alex Miller
Created 10-30-2015 06:22 PM
Thanks, hopefully it will save someone the hassle in the future.
In the future, please leave this as a comment rather than a separate answer.
Created 10-30-2015 06:28 PM
I Agree. Sorry, using AH for the first time and accidentally clicked reply instead of comment 😞
Created 11-01-2015 12:54 AM
No worries, we're all learning it as we go