Created 06-13-2021 10:35 PM
2021-06-09 17:00:54,088 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT
java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
2021-06-09 17:00:54,090 INFO ha.HealthMonitor (HealthMonitor.java:enterState(248)) - Entering state SERVICE_NOT_RESPONDING
2021-06-09 17:00:54,090 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(893)) - Local service NameNode at NAMENODE/NAMENODE:PORT entered state: SERVICE_NOT_RESPONDING
2021-06-09 17:00:54,191 WARN tools.DFSZKFailoverController (DFSZKFailoverController.java:getLocalNNThreadDump(249)) - Can't get local NN thread dump due to Server returned HTTP response code: 401 for URL: https://NAMENODE:PORT3/stacks
2021-06-09 17:00:54,191 INFO ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(809)) - Quitting master election for NameNode at NAMENODE/NAMENODE:PORT and marking that fencing is necessary
2021-06-09 17:00:54,191 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(412)) - Yielding from election
2021-06-09 17:00:54,192 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x178072757b716f6 closed
2021-06-09 17:00:54,192 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1124)) - Ignoring stale result from old client with sessionId 0x1234567
2021-06-09 17:00:54,192 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
I have two namenodes with HA.
Suddenly a failover occured, and the above log was found in the previous active namenode.
I have no idea why SocketTimeoutException was raised while doing doHealthChecks.
Also, regarding "java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout" log, when I look for PORT2 in the namenode, that port doesn't seem to be used.
Any comments appreciated.
Created 06-16-2021 07:45 PM
Hi,
Suspecting the foc isn't picking up the configured timeout for ha.health-monitor.rpc-timeout.ms and this is causing the failover to fail.
To speed up this quota calculation put the following in the NameNode safety valve for hdfs-site.xml:
dfs.namenode.quota.init-threads = 16
ha.failover-controller.new-active.rpc-timeout.ms to 90s
Try this out .....
Created 06-17-2021 11:45 AM
Hello @sipocootap2
The failover controller log snippet you shared here indicating the HealthMonitor thread on Active NameNode couldn't fetch the state of the local NameNode (via health check RPC) within "ha.health-monitor.rpc-timeout.ms" timeout period of 45sec (45000ms). Since there is no response within the timeout period from the local NN, the NN service entered into the "SERVICE_NOT_RESPONDING" state.
NOTE: "The HealthMonitor is a thread which is responsible for monitoring the local NameNode. It operates in a simple loop, calling the monitorHealth RPC. The HealthMonitor maintains a view of the current state of the NameNode based on the responses to these RPCs. When it transitions between states, it sends a message via a callback interface to the ZKFC."
The condition you cited here suggests the local NN (Active NameNode here) went unresponsive/hung or busy. Hence the local FailoverController (activeNN_zkfc) triggered a NN failover after monitorHealth RPC timed out and suggest the Standby NameNode host failover controller (SbNN_zkfc) to promote/transition local standby NN to Active State.
Answers to your query
Q) I have no idea why SocketTimeoutException was raised while doing doHealthChecks.
Ans) Looks like Active NN was unresponsive or busy, hence the RPC call was timed out (marked with socket timeout exception)
Q) "java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout" log, when I look for PORT2 in the namenode, that port doesn't seem to be used.
Ans) The PORT2 (local=/NAMENODE:PORT2) you see is an ephemeral port (any random port) used by HealthMonitor RPC to communicate with local NN service port 8022 ( remote=NAMENODE/NAMENODE:PORT). Since health monitor thread is local to NN means running on same node as NN, you see NN hostname appearing as both local and remote endpoint.
Ref: https://community.cloudera.com/t5/Support-Questions/Namenode-failover-frequently/td-p/41122
Created 06-23-2021 05:38 AM
@sipocootap2 , have you resolved your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
Created 08-31-2021 11:48 PM
Will formatting zkfc and restarting namenode work as this issue is basically due to communication failure between HealthcheckRPC of zkfc and local namenode.