<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/ha-HealthMonitor-HealthMonitor-java-doHealthChecks-210/m-p/318375#M227451</link>
    <description>&lt;LI-CODE lang="markup"&gt;2021-06-09 17:00:54,088 WARN  ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT
java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout

2021-06-09 17:00:54,090 INFO  ha.HealthMonitor (HealthMonitor.java:enterState(248)) - Entering state SERVICE_NOT_RESPONDING
2021-06-09 17:00:54,090 INFO  ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(893)) - Local service NameNode at NAMENODE/NAMENODE:PORT entered state: SERVICE_NOT_RESPONDING
2021-06-09 17:00:54,191 WARN  tools.DFSZKFailoverController (DFSZKFailoverController.java:getLocalNNThreadDump(249)) - Can't get local NN thread dump due to Server returned HTTP response code: 401 for URL: https://NAMENODE:PORT3/stacks
2021-06-09 17:00:54,191 INFO  ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(809)) - Quitting master election for NameNode at NAMENODE/NAMENODE:PORT and marking that fencing is necessary
2021-06-09 17:00:54,191 INFO  ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(412)) - Yielding from election
2021-06-09 17:00:54,192 INFO  zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x178072757b716f6 closed
2021-06-09 17:00:54,192 WARN  ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1124)) - Ignoring stale result from old client with sessionId 0x1234567
2021-06-09 17:00:54,192 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have two namenodes with HA.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Suddenly a failover occured, and the above log was found in the previous active namenode.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have no idea why SocketTimeoutException was raised while doing doHealthChecks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Also, regarding "java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see: &lt;A href="http://wiki.apache.org/hadoop/SocketTimeout&amp;quot;" target="_blank"&gt;http://wiki.apache.org/hadoop/SocketTimeout"&lt;/A&gt;&amp;nbsp;log, when I look for PORT2 in the namenode, that port doesn't seem to be used.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any comments appreciated.&lt;/P&gt;</description>
    <pubDate>Mon, 14 Jun 2021 05:35:06 GMT</pubDate>
    <dc:creator>sipocootap2</dc:creator>
    <dc:date>2021-06-14T05:35:06Z</dc:date>
    <item>
      <title>ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT</title>
      <link>https://community.cloudera.com/t5/Support-Questions/ha-HealthMonitor-HealthMonitor-java-doHealthChecks-210/m-p/318375#M227451</link>
      <description>&lt;LI-CODE lang="markup"&gt;2021-06-09 17:00:54,088 WARN  ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT
java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout

2021-06-09 17:00:54,090 INFO  ha.HealthMonitor (HealthMonitor.java:enterState(248)) - Entering state SERVICE_NOT_RESPONDING
2021-06-09 17:00:54,090 INFO  ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(893)) - Local service NameNode at NAMENODE/NAMENODE:PORT entered state: SERVICE_NOT_RESPONDING
2021-06-09 17:00:54,191 WARN  tools.DFSZKFailoverController (DFSZKFailoverController.java:getLocalNNThreadDump(249)) - Can't get local NN thread dump due to Server returned HTTP response code: 401 for URL: https://NAMENODE:PORT3/stacks
2021-06-09 17:00:54,191 INFO  ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(809)) - Quitting master election for NameNode at NAMENODE/NAMENODE:PORT and marking that fencing is necessary
2021-06-09 17:00:54,191 INFO  ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(412)) - Yielding from election
2021-06-09 17:00:54,192 INFO  zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x178072757b716f6 closed
2021-06-09 17:00:54,192 WARN  ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1124)) - Ignoring stale result from old client with sessionId 0x1234567
2021-06-09 17:00:54,192 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have two namenodes with HA.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Suddenly a failover occured, and the above log was found in the previous active namenode.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have no idea why SocketTimeoutException was raised while doing doHealthChecks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Also, regarding "java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see: &lt;A href="http://wiki.apache.org/hadoop/SocketTimeout&amp;quot;" target="_blank"&gt;http://wiki.apache.org/hadoop/SocketTimeout"&lt;/A&gt;&amp;nbsp;log, when I look for PORT2 in the namenode, that port doesn't seem to be used.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any comments appreciated.&lt;/P&gt;</description>
      <pubDate>Mon, 14 Jun 2021 05:35:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/ha-HealthMonitor-HealthMonitor-java-doHealthChecks-210/m-p/318375#M227451</guid>
      <dc:creator>sipocootap2</dc:creator>
      <dc:date>2021-06-14T05:35:06Z</dc:date>
    </item>
    <item>
      <title>Re: ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT</title>
      <link>https://community.cloudera.com/t5/Support-Questions/ha-HealthMonitor-HealthMonitor-java-doHealthChecks-210/m-p/318847#M227558</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P class="p1"&gt;Suspecting the foc isn't picking up the configured timeout for ha.health-monitor.rpc-timeout.ms and this is causing the failover to fail.&lt;/P&gt;&lt;P class="p2"&gt;To speed up this quota calculation put the following in the NameNode safety valve for hdfs-site.xml:&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p2"&gt;dfs.namenode.quota.init-threads = 16&lt;/P&gt;&lt;P class="p1"&gt;ha.failover-controller.new-active.rpc-timeout.ms to 90s&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;Try this out .....&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jun 2021 02:45:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/ha-HealthMonitor-HealthMonitor-java-doHealthChecks-210/m-p/318847#M227558</guid>
      <dc:creator>ChethanYM</dc:creator>
      <dc:date>2021-06-17T02:45:48Z</dc:date>
    </item>
    <item>
      <title>Re: ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT</title>
      <link>https://community.cloudera.com/t5/Support-Questions/ha-HealthMonitor-HealthMonitor-java-doHealthChecks-210/m-p/318887#M227582</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/83034"&gt;@sipocootap2&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The failover controller log snippet you shared here indicating the&amp;nbsp;&lt;I&gt;&lt;STRONG&gt;HealthMonitor &lt;/STRONG&gt;&lt;/I&gt;thread on Active NameNode couldn't fetch the state of the local NameNode (via&amp;nbsp;&lt;SPAN&gt;health check RPC) within &lt;STRONG&gt;"ha.health-monitor.rpc-timeout.ms"&lt;/STRONG&gt; timeout period of &lt;EM&gt;&lt;STRONG&gt;45sec (45000ms)&lt;/STRONG&gt;&lt;/EM&gt;. Since there is no response within the timeout period from the local NN, the NN service entered into the "SERVICE_NOT_RESPONDING" state.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;I&gt;&lt;STRONG&gt;NOTE:&lt;/STRONG&gt; "The HealthMonitor is a thread which is responsible for monitoring the local NameNode. It operates in a simple loop, calling the monitorHealth RPC. The HealthMonitor maintains a view of the current state of the NameNode based on the responses to these RPCs. When it transitions between states, it sends a message via a callback interface to the ZKFC."&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The condition you cited here suggests the local NN (Active NameNode here) went unresponsive/hung or busy. Hence the local FailoverController (activeNN_zkfc) triggered a NN failover after&amp;nbsp;&lt;STRONG&gt;&lt;I&gt;monitorHealth RPC timed out&amp;nbsp;&lt;/I&gt;&lt;/STRONG&gt;and suggest the Standby NameNode host failover controller (SbNN_zkfc) to promote/transition local standby NN to Active State.&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Answers to your query&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;Q&lt;/STRONG&gt;)&amp;nbsp;I have no idea why SocketTimeoutException was raised while doing doHealthChecks.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;Ans&lt;/STRONG&gt;) Looks like Active NN was unresponsive or busy, hence the RPC call was timed out (marked with socket timeout exception)&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;Q&lt;/STRONG&gt;)&amp;nbsp;&amp;nbsp;"java.net.SocketTimeoutException: Call From NAMENODE/NAMENODE to NAMENODE:PORT failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/NAMENODE:PORT2 remote=NAMENODE/NAMENODE:PORT]; For more details see:&amp;nbsp;&lt;A href="http://wiki.apache.org/hadoop/SocketTimeout%22" target="_blank" rel="nofollow noopener noreferrer"&gt;http://wiki.apache.org/hadoop/SocketTimeout"&lt;/A&gt;&amp;nbsp;log, when I look for PORT2 in the namenode, that port doesn't seem to be used.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;Ans&lt;/STRONG&gt;) The PORT2 (local=/NAMENODE:&lt;STRONG&gt;PORT2&lt;/STRONG&gt;) you see is an &lt;STRONG&gt;ephemeral port&lt;/STRONG&gt; (any random port) used by HealthMonitor RPC to communicate with local NN service port &lt;STRONG&gt;8022&lt;/STRONG&gt; ( remote=NAMENODE/NAMENODE:&lt;STRONG&gt;PORT&lt;/STRONG&gt;). Since health monitor thread is local to NN means running on same node as NN, you see NN hostname appearing as both local and remote endpoint.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Ref:&amp;nbsp;&lt;A href="https://community.cloudera.com/t5/Support-Questions/Namenode-failover-frequently/td-p/41122" target="_blank"&gt;https://community.cloudera.com/t5/Support-Questions/Namenode-failover-frequently/td-p/41122&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jun 2021 18:45:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/ha-HealthMonitor-HealthMonitor-java-doHealthChecks-210/m-p/318887#M227582</guid>
      <dc:creator>PabitraDas</dc:creator>
      <dc:date>2021-06-17T18:45:41Z</dc:date>
    </item>
    <item>
      <title>Re: ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT</title>
      <link>https://community.cloudera.com/t5/Support-Questions/ha-HealthMonitor-HealthMonitor-java-doHealthChecks-210/m-p/319183#M227710</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/83034"&gt;@sipocootap2&lt;/a&gt;&amp;nbsp;,&amp;nbsp;have you resolved your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="cjervis_0-1624451908398.png" style="width: 400px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/31641i3E8A1F0DB9387C79/image-size/medium?v=v2&amp;amp;px=400" role="button" title="cjervis_0-1624451908398.png" alt="cjervis_0-1624451908398.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 23 Jun 2021 12:38:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/ha-HealthMonitor-HealthMonitor-java-doHealthChecks-210/m-p/319183#M227710</guid>
      <dc:creator>cjervis</dc:creator>
      <dc:date>2021-06-23T12:38:46Z</dc:date>
    </item>
    <item>
      <title>Re: ha.HealthMonitor (HealthMonitor.java:doHealthChecks(210)) - Transport-level exception trying to monitor health of NameNode at NAMENODE/NAMENODE:PORT</title>
      <link>https://community.cloudera.com/t5/Support-Questions/ha-HealthMonitor-HealthMonitor-java-doHealthChecks-210/m-p/323462#M229122</link>
      <description>&lt;P&gt;Will formatting zkfc and restarting namenode work as this issue is basically due to communication failure between HealthcheckRPC of zkfc and local namenode.&lt;/P&gt;</description>
      <pubDate>Wed, 01 Sep 2021 06:48:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/ha-HealthMonitor-HealthMonitor-java-doHealthChecks-210/m-p/323462#M229122</guid>
      <dc:creator>singhvNt</dc:creator>
      <dc:date>2021-09-01T06:48:04Z</dc:date>
    </item>
  </channel>
</rss>

