Created 07-21-2017 08:56 AM
Hello,
I am checking JMX metrics with a period of time to monitor the cluster health.
When I try to check my monitoring platform I saw that it is too late to update. The case is a dead datanode. I stop one of the datanode services on Ambari and expect to see below data to change from 0 to 1:
http://namenodeaddress:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState
{ "name" : "Hadoop:service=NameNode,name=FSNamesystemState", "modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem", ... "NumDeadDataNodes" : 0, ... ... }
It was updated 6 minutes later. It is a very long time to take an action. However when I start the service again, it is updated from 1 to 0 as soon as service was started.
Can someone check it for me if this is the normal update time?
PS: I know Ambari is faster to detect. Probably it uses another method to detect dead nodes. I need to check this to continue parsing other metrics.
Thanks in advance.
Created 07-25-2017 05:39 PM
- The Namenode determines whether a datanode dead or alive by using heartbeats.
- Each DataNode sends a Heartbeat message to the NameNode every 3 seconds (default value).
- This heartbeat interval is controlled by the "dfs.heartbeat.interval" property defined in hdfs-site.xml file.
- If a datanode dies, namenode waits for almost 10 mins before removing it from live nodes.
- The time period for determining whether a datanode is dead is calculated as
dfs.namenode.heartbeat.recheck-interval + 10 * 1000 * dfs.heartbeat.interval
The default values for "dfs.namenode.heartbeat.recheck-interval" is 300000 milliseconds(5 minutes) and dfs.heartbeat.interval is "3 seconds"
.
Reference:
- http://pe-kay.blogspot.com/2016/02/dead-datanode-detection.html
.
Created 07-25-2017 01:21 PM
Follow-up comment... Any comments?
Created 07-25-2017 05:39 PM
- The Namenode determines whether a datanode dead or alive by using heartbeats.
- Each DataNode sends a Heartbeat message to the NameNode every 3 seconds (default value).
- This heartbeat interval is controlled by the "dfs.heartbeat.interval" property defined in hdfs-site.xml file.
- If a datanode dies, namenode waits for almost 10 mins before removing it from live nodes.
- The time period for determining whether a datanode is dead is calculated as
dfs.namenode.heartbeat.recheck-interval + 10 * 1000 * dfs.heartbeat.interval
The default values for "dfs.namenode.heartbeat.recheck-interval" is 300000 milliseconds(5 minutes) and dfs.heartbeat.interval is "3 seconds"
.
Reference:
- http://pe-kay.blogspot.com/2016/02/dead-datanode-detection.html
.
Created 07-26-2017 08:47 AM
You are awesome enough to thank so much! 🙂
I was expecting just to see if the behaviour I see is normal but your explanation to me like teaching to fish instead of giving it. I have learned the procedure instead, how it worked.
Thanks again! 😄