question ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT in Support Questions

ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT

sipocootap2 — Mon, 14 Jun 2021 05:35:06 GMT

2021-06-09 17:00:54,088 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2021-06-09 17:00:54,090 INFO ha.HealthMonitor (HealthMonitor.java:enterState(248)) - Entering state SERVICE_NOT_RESPONDING 2021-06-09 17:00:54,090 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(893)) - Local service NameNode at NAMENODE/NAMENODE:PORT entered state: SERVICE_NOT_RESPONDING 2021-06-09 17:00:54,191 WARN tools.DFSZKFailoverController (DFSZKFailoverController.java:getLocalNNThreadDump(249)) - Can't get local NN thread dump due to Server returned HTTP response code: 401 for URL: https://NAMENODE:PORT3/stacks 2021-06-09 17:00:54,191 INFO ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(809)) - Quitting master election for NameNode at NAMENODE/NAMENODE:PORT and marking that fencing is necessary 2021-06-09 17:00:54,191 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(412)) - Yielding from election 2021-06-09 17:00:54,192 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x178072757b716f6 closed 2021-06-09 17:00:54,192 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1124)) - Ignoring stale result from old client with sessionId 0x1234567 2021-06-09 17:00:54,192 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down

I have two namenodes with HA.

Suddenly a failover occured, and the above log was found in the previous active namenode.

I have no idea why SocketTimeoutException was raised while doing doHealthChecks.

Also, regarding "java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout" log, when I look for PORT2 in the namenode, that port doesn't seem to be used.

Any comments appreciated.

Re: ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT

ChethanYM — Thu, 17 Jun 2021 02:45:48 GMT

Hi,

Suspecting the foc isn't picking up the configured timeout for ha.health-monitor.rpc-timeout.ms and this is causing the failover to fail.

To speed up this quota calculation put the following in the NameNode safety valve for hdfs-site.xml:

dfs.namenode.quota.init-threads = 16

ha.failover-controller.new-active.rpc-timeout.ms to 90s

Try this out .....

Re: ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT

PabitraDas — Thu, 17 Jun 2021 18:45:41 GMT

Hello @sipocootap2

The failover controller log snippet you shared here indicating the HealthMonitor thread on Active NameNode couldn't fetch the state of the local NameNode (via health check RPC) within "ha.health-monitor.rpc-timeout.ms" timeout period of 45sec (45000ms). Since there is no response within the timeout period from the local NN, the NN service entered into the "SERVICE_NOT_RESPONDING" state.

NOTE: "The HealthMonitor is a thread which is responsible for monitoring the local NameNode. It operates in a simple loop, calling the monitorHealth RPC. The HealthMonitor maintains a view of the current state of the NameNode based on the responses to these RPCs. When it transitions between states, it sends a message via a callback interface to the ZKFC."

The condition you cited here suggests the local NN (Active NameNode here) went unresponsive/hung or busy. Hence the local FailoverController (activeNN_zkfc) triggered a NN failover after monitorHealth RPC timed out and suggest the Standby NameNode host failover controller (SbNN_zkfc) to promote/transition local standby NN to Active State.

Answers to your query

Q) I have no idea why SocketTimeoutException was raised while doing doHealthChecks.

Ans) Looks like Active NN was unresponsive or busy, hence the RPC call was timed out (marked with socket timeout exception)

Q) "java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout" log, when I look for PORT2 in the namenode, that port doesn't seem to be used.

Ans) The PORT2 (local=/NAMENODE:PORT2) you see is an ephemeral port (any random port) used by HealthMonitor RPC to communicate with local NN service port 8022 ( remote=NAMENODE/NAMENODE:PORT). Since health monitor thread is local to NN means running on same node as NN, you see NN hostname appearing as both local and remote endpoint.

Ref: https://community.cloudera.com/t5/Support-Questions/Namenode-failover-frequently/td-p/41122

Re: ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT

cjervis — Wed, 23 Jun 2021 12:38:46 GMT

@sipocootap2 , have you resolved your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.

Re: ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT

singhvNt — Wed, 01 Sep 2021 06:48:04 GMT

Will formatting zkfc and restarting namenode work as this issue is basically due to communication failure between HealthcheckRPC of zkfc and local namenode.