Support Questions

ayusuf · ‎10-30-2015

Datanode Health Summary in Ambari Alerts reported 1 stale node. How to identify which datannode is in stale state?

amiller · ‎10-30-2015

A datanode is considered stale when:

dfs.namenode.stale.datanode.interval < last contact < (2 * dfs.namenode.heartbeat.recheck-interval)

In the NameNode UI Datanodes tab, a stale datanode will stand out due to having a larger value for Last contact among live datanodes (also available in JMX output). When a datanode is stale, it will be given lowest priority for reads and writes.

Using default values, the namenode will consider a datanode stale when its heartbeat is absent for 30 seconds. After another 10 minutes without a heartbeat (10.5 minutes total), a datanode is considered dead.

Relevant properties include:

dfs.heartbeat.interval - default: 3 seconds
dfs.namenode.stale.datanode.interval - default: 30 seconds
dfs.namenode.heartbeat.recheck-interval - default: 5 minutes
dfs.namenode.avoid.read.stale.datanode - default: true
dfs.namenode.avoid.write.stale.datanode - default: true

This feature was introduced by HDFS-3703.

View solution in original post

ssingla · ‎10-30-2015

probably the namenode logs should say that..

nsabharwal · ‎10-30-2015

@ayusuf@hortonworks.com

This is good explanation ...namenode will know about the stale DN

dfs.namenode.stale.datanode.interval

Default time interval for marking a datanode as "stale", i.e., if the namenode has not received heartbeat msg from a datanode for more than this time interval, the datanode will be marked and treated as "stale" by default. The stale interval cannot be too small since otherwise this may cause too frequent change of stale states. We thus set a minimum stale interval value (the default value is 3 times of heartbeat interval) and guarantee that the stale interval cannot be less than the minimum value. A stale data node is avoided during lease/block recovery. It can be conditionally avoided for reads (see dfs.namenode.avoid.read.stale.datanode) and for writes (see dfs.namenode.avoid.write.stale.datanode).

amiller · ‎10-30-2015

A datanode is considered stale when:

dfs.namenode.stale.datanode.interval < last contact < (2 * dfs.namenode.heartbeat.recheck-interval)

In the NameNode UI Datanodes tab, a stale datanode will stand out due to having a larger value for Last contact among live datanodes (also available in JMX output). When a datanode is stale, it will be given lowest priority for reads and writes.

Using default values, the namenode will consider a datanode stale when its heartbeat is absent for 30 seconds. After another 10 minutes without a heartbeat (10.5 minutes total), a datanode is considered dead.

Relevant properties include:

dfs.heartbeat.interval - default: 3 seconds
dfs.namenode.stale.datanode.interval - default: 30 seconds
dfs.namenode.heartbeat.recheck-interval - default: 5 minutes
dfs.namenode.avoid.read.stale.datanode - default: true
dfs.namenode.avoid.write.stale.datanode - default: true

This feature was introduced by HDFS-3703.

nsabharwal · ‎10-30-2015

Nicely explained! Thanks @Alex Miller

ayusuf · ‎10-30-2015

Thanks Alex. Very good explanation. 😞 I learned this the hard way yesterday night, by bringing the network down (ifdown eth1) while the datanode was up in one of VM nodes and refreshing the Namenode UI -> Datanode tab. @Alex Miller

amiller · ‎10-30-2015

Thanks, hopefully it will save someone the hassle in the future.

In the future, please leave this as a comment rather than a separate answer.

ayusuf · ‎10-30-2015

I Agree. Sorry, using AH for the first time and accidentally clicked reply instead of comment 😞

amiller · ‎11-01-2015

No worries, we're all learning it as we go

Cloudera Community

Support Questions

How to identify stale datanode?

frequently getting stale alerts for Datanodes.

Datanode Service Error Related to NFS Mount Issue

Garbage Collection Pauses in Namenode and Datanode

ambari Datanode stale status

HDFS checklist for identifying missing/corrupt blo...

How to identify what is consuming space in HDFS

Using OpenNLP for Identifying Names From Text

Identifying missing Table entries in Atlas

How to identify in cdp cluster having small files ...

Ambari Stale Alert