Support Questions

Find answers, ask questions, and share your expertise

ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT

avatar
Contributor
2021-06-09 17:00:54,088 WARN  ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT
java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout

2021-06-09 17:00:54,090 INFO  ha.HealthMonitor (HealthMonitor.java:enterState(248)) - Entering state SERVICE_NOT_RESPONDING
2021-06-09 17:00:54,090 INFO  ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(893)) - Local service NameNode at NAMENODE/NAMENODE:PORT entered state: SERVICE_NOT_RESPONDING
2021-06-09 17:00:54,191 WARN  tools.DFSZKFailoverController (DFSZKFailoverController.java:getLocalNNThreadDump(249)) - Can't get local NN thread dump due to Server returned HTTP response code: 401 for URL: https://NAMENODE:PORT3/stacks
2021-06-09 17:00:54,191 INFO  ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(809)) - Quitting master election for NameNode at NAMENODE/NAMENODE:PORT and marking that fencing is necessary
2021-06-09 17:00:54,191 INFO  ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(412)) - Yielding from election
2021-06-09 17:00:54,192 INFO  zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x178072757b716f6 closed
2021-06-09 17:00:54,192 WARN  ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1124)) - Ignoring stale result from old client with sessionId 0x1234567
2021-06-09 17:00:54,192 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down

 

I have two namenodes with HA.

 

Suddenly a failover occured, and the above log was found in the previous active namenode.

 

I have no idea why SocketTimeoutException was raised while doing doHealthChecks.

 

Also, regarding "java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout" log, when I look for PORT2 in the namenode, that port doesn't seem to be used.

 

Any comments appreciated.

4 REPLIES 4

avatar
Master Collaborator

Hi,

Suspecting the foc isn't picking up the configured timeout for ha.health-monitor.rpc-timeout.ms and this is causing the failover to fail.

To speed up this quota calculation put the following in the NameNode safety valve for hdfs-site.xml:

 

dfs.namenode.quota.init-threads = 16

ha.failover-controller.new-active.rpc-timeout.ms to 90s

 

Try this out .....

avatar
Expert Contributor

Hello @sipocootap2 

The failover controller log snippet you shared here indicating the HealthMonitor thread on Active NameNode couldn't fetch the state of the local NameNode (via health check RPC) within "ha.health-monitor.rpc-timeout.ms" timeout period of 45sec (45000ms). Since there is no response within the timeout period from the local NN, the NN service entered into the "SERVICE_NOT_RESPONDING" state. 

 

NOTE: "The HealthMonitor is a thread which is responsible for monitoring the local NameNode. It operates in a simple loop, calling the monitorHealth RPC. The HealthMonitor maintains a view of the current state of the NameNode based on the responses to these RPCs. When it transitions between states, it sends a message via a callback interface to the ZKFC."

 

The condition you cited here suggests the local NN (Active NameNode here) went unresponsive/hung or busy. Hence the local FailoverController (activeNN_zkfc) triggered a NN failover after monitorHealth RPC timed out and suggest the Standby NameNode host failover controller (SbNN_zkfc) to promote/transition local standby NN to Active State.  

 

Answers to your query

Q) I have no idea why SocketTimeoutException was raised while doing doHealthChecks.

 

Ans) Looks like Active NN was unresponsive or busy, hence the RPC call was timed out (marked with socket timeout exception) 

 

Q)  "java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout" log, when I look for PORT2 in the namenode, that port doesn't seem to be used.

 

Ans) The PORT2 (local=/NAMENODE:PORT2) you see is an ephemeral port (any random port) used by HealthMonitor RPC to communicate with local NN service port 8022 ( remote=NAMENODE/NAMENODE:PORT). Since health monitor thread is local to NN means running on same node as NN, you see NN hostname appearing as both local and remote endpoint. 

 

Ref: https://community.cloudera.com/t5/Support-Questions/Namenode-failover-frequently/td-p/41122

avatar
Community Manager

@sipocootap2 , have you resolved your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. 

cjervis_0-1624451908398.png


Cy Jervis, Manager, Community Program
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Explorer

Will formatting zkfc and restarting namenode work as this issue is basically due to communication failure between HealthcheckRPC of zkfc and local namenode.