Support Questions

Find answers, ask questions, and share your expertise

How to identify stale datanode?

avatar
New Contributor

Datanode Health Summary in Ambari Alerts reported 1 stale node. How to identify which datannode is in stale state?

1 ACCEPTED SOLUTION

avatar

A datanode is considered stale when:

dfs.namenode.stale.datanode.interval < last contact < (2 * dfs.namenode.heartbeat.recheck-interval)

In the NameNode UI Datanodes tab, a stale datanode will stand out due to having a larger value for Last contact among live datanodes (also available in JMX output). When a datanode is stale, it will be given lowest priority for reads and writes.

Using default values, the namenode will consider a datanode stale when its heartbeat is absent for 30 seconds. After another 10 minutes without a heartbeat (10.5 minutes total), a datanode is considered dead.

Relevant properties include:

  • dfs.heartbeat.interval - default: 3 seconds
  • dfs.namenode.stale.datanode.interval - default: 30 seconds
  • dfs.namenode.heartbeat.recheck-interval - default: 5 minutes
  • dfs.namenode.avoid.read.stale.datanode - default: true
  • dfs.namenode.avoid.write.stale.datanode - default: true

This feature was introduced by HDFS-3703.

View solution in original post

8 REPLIES 8

avatar
Rising Star

probably the namenode logs should say that..

avatar
Master Mentor

@ayusuf@hortonworks.com

This is good explanation ...namenode will know about the stale DN

dfs.namenode.stale.datanode.interval

Default time interval for marking a datanode as "stale", i.e., if the namenode has not received heartbeat msg from a datanode for more than this time interval, the datanode will be marked and treated as "stale" by default. The stale interval cannot be too small since otherwise this may cause too frequent change of stale states. We thus set a minimum stale interval value (the default value is 3 times of heartbeat interval) and guarantee that the stale interval cannot be less than the minimum value. A stale data node is avoided during lease/block recovery. It can be conditionally avoided for reads (see dfs.namenode.avoid.read.stale.datanode) and for writes (see dfs.namenode.avoid.write.stale.datanode).

avatar

A datanode is considered stale when:

dfs.namenode.stale.datanode.interval < last contact < (2 * dfs.namenode.heartbeat.recheck-interval)

In the NameNode UI Datanodes tab, a stale datanode will stand out due to having a larger value for Last contact among live datanodes (also available in JMX output). When a datanode is stale, it will be given lowest priority for reads and writes.

Using default values, the namenode will consider a datanode stale when its heartbeat is absent for 30 seconds. After another 10 minutes without a heartbeat (10.5 minutes total), a datanode is considered dead.

Relevant properties include:

  • dfs.heartbeat.interval - default: 3 seconds
  • dfs.namenode.stale.datanode.interval - default: 30 seconds
  • dfs.namenode.heartbeat.recheck-interval - default: 5 minutes
  • dfs.namenode.avoid.read.stale.datanode - default: true
  • dfs.namenode.avoid.write.stale.datanode - default: true

This feature was introduced by HDFS-3703.

avatar
Master Mentor

Nicely explained! Thanks @Alex Miller

avatar
New Contributor

Thanks Alex. Very good explanation. 😞 I learned this the hard way yesterday night, by bringing the network down (ifdown eth1) while the datanode was up in one of VM nodes and refreshing the Namenode UI -> Datanode tab. @Alex Miller

avatar

Thanks, hopefully it will save someone the hassle in the future.

In the future, please leave this as a comment rather than a separate answer.

avatar
New Contributor

I Agree. Sorry, using AH for the first time and accidentally clicked reply instead of comment 😞

avatar

No worries, we're all learning it as we go