Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

JMX metric is too late to update. I need a check from you.

avatar
Expert Contributor

Hello,

I am checking JMX metrics with a period of time to monitor the cluster health.

When I try to check my monitoring platform I saw that it is too late to update. The case is a dead datanode. I stop one of the datanode services on Ambari and expect to see below data to change from 0 to 1:

http://namenodeaddress:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState

{
    "name" : "Hadoop:service=NameNode,name=FSNamesystemState",
    "modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem",
    ...
    "NumDeadDataNodes" : 0,
    ...

...
  }

It was updated 6 minutes later. It is a very long time to take an action. However when I start the service again, it is updated from 1 to 0 as soon as service was started.

Can someone check it for me if this is the normal update time?

PS: I know Ambari is faster to detect. Probably it uses another method to detect dead nodes. I need to check this to continue parsing other metrics.

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Sedat Kestepe

- The Namenode determines whether a datanode dead or alive by using heartbeats.

- Each DataNode sends a Heartbeat message to the NameNode every 3 seconds (default value).

- This heartbeat interval is controlled by the "dfs.heartbeat.interval" property defined in hdfs-site.xml file.

- If a datanode dies, namenode waits for almost 10 mins before removing it from live nodes.

- The time period for determining whether a datanode is dead is calculated as

dfs.namenode.heartbeat.recheck-interval + 10 * 1000 * dfs.heartbeat.interval

The default values for "dfs.namenode.heartbeat.recheck-interval" is 300000 milliseconds(5 minutes) and dfs.heartbeat.interval is "3 seconds"

.

Reference:

- https://github.com/apache/hadoop/blob/release-2.7.3-RC1/hadoop-hdfs-project/hadoop-hdfs/src/main/jav...

- http://pe-kay.blogspot.com/2016/02/dead-datanode-detection.html

.

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

Follow-up comment... Any comments?

avatar
Master Mentor

@Sedat Kestepe

- The Namenode determines whether a datanode dead or alive by using heartbeats.

- Each DataNode sends a Heartbeat message to the NameNode every 3 seconds (default value).

- This heartbeat interval is controlled by the "dfs.heartbeat.interval" property defined in hdfs-site.xml file.

- If a datanode dies, namenode waits for almost 10 mins before removing it from live nodes.

- The time period for determining whether a datanode is dead is calculated as

dfs.namenode.heartbeat.recheck-interval + 10 * 1000 * dfs.heartbeat.interval

The default values for "dfs.namenode.heartbeat.recheck-interval" is 300000 milliseconds(5 minutes) and dfs.heartbeat.interval is "3 seconds"

.

Reference:

- https://github.com/apache/hadoop/blob/release-2.7.3-RC1/hadoop-hdfs-project/hadoop-hdfs/src/main/jav...

- http://pe-kay.blogspot.com/2016/02/dead-datanode-detection.html

.

avatar
Expert Contributor

You are awesome enough to thank so much! 🙂

I was expecting just to see if the behaviour I see is normal but your explanation to me like teaching to fish instead of giving it. I have learned the procedure instead, how it worked.

Thanks again! 😄