<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: NameNode and Standbynamenode auto stopped in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194956#M157015</link>
    <description>&lt;P&gt;3 zookeeper are running and 3 journalmanager are running. &lt;/P&gt;</description>
    <pubDate>Fri, 16 Mar 2018 18:25:18 GMT</pubDate>
    <dc:creator>vinayk</dc:creator>
    <dc:date>2018-03-16T18:25:18Z</dc:date>
    <item>
      <title>NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194954#M157013</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm having 1 master namenode, 1 standby namenode and 4 data node, and 3 journal node.&lt;/P&gt;&lt;P&gt;Before enable HA of Name node , Cluster was working fine. When we enabled HA of NAme node we are facing problem of auto stopped Name node. Below logs find on name node:&lt;/P&gt;&lt;P&gt;2018-03-16 05:49:34,799 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1648)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-03-16 05:49:35,684 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 16048 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so
 far: [10.10.20.5:8485]
2018-03-16 05:49:36,686 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 17050 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so
 far: [10.10.20.5:8485]
2018-03-16 05:49:37,688 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 18052 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so
 far: [10.10.20.5:8485]
2018-03-16 05:49:37,850 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1648)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2018-03-16 05:49:38,690 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 19054 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so
 far: [10.10.20.5:8485]&lt;/P&gt;&lt;P&gt;2018-03-16 05:49:39,637 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM
to [10.10.20.5:8485, 10.10.20.15:8485, 10.10.20.13:8485], stream=QuorumOutputStream starting at txid 5667564))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:707)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3722)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:912)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:548)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)
2018-03-16 05:49:39,638 WARN  client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 5667564
2018-03-16 05:49:39,647 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2018-03-16 05:49:39,663 INFO  namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:&lt;/P&gt;&lt;P&gt;Please assist..!!!&lt;/P&gt;</description>
      <pubDate>Fri, 16 Mar 2018 17:02:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194954#M157013</guid>
      <dc:creator>vinayk</dc:creator>
      <dc:date>2018-03-16T17:02:26Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194955#M157014</link>
      <description>&lt;P&gt;@Vinay K &lt;/P&gt;&lt;P&gt;How many zookeepers do you have up and running?&lt;/P&gt;&lt;P&gt;Make sure all the 3 JournalManager are running too.&lt;/P&gt;</description>
      <pubDate>Fri, 16 Mar 2018 18:01:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194955#M157014</guid>
      <dc:creator>Shelton</dc:creator>
      <dc:date>2018-03-16T18:01:33Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194956#M157015</link>
      <description>&lt;P&gt;3 zookeeper are running and 3 journalmanager are running. &lt;/P&gt;</description>
      <pubDate>Fri, 16 Mar 2018 18:25:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194956#M157015</guid>
      <dc:creator>vinayk</dc:creator>
      <dc:date>2018-03-16T18:25:18Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194957#M157016</link>
      <description>&lt;P&gt;&lt;EM&gt;@&lt;A href="https://community.hortonworks.com/users/69412/testtest12p.html"&gt;Vinay K&lt;/A&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;It seems the disk the journals are on is just too busy. &lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;If it's not a big and production cluster can you reboot it. Can you also adjust the following in the two files and retry! &lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;hdfs-site.xml &lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;dfs.qjournal.start-segment.timeout.ms = 90000 
dfs.qjournal.select-input-streams.timeout.ms = 90000 
dfs.qjournal.write-txns.timeout.ms = 90000 &lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;core-site.xml &lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;ipc.client.connect.timeout = 90000&lt;/PRE&gt;&lt;P&gt;&lt;EM&gt;Please revert&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Mar 2018 20:54:00 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194957#M157016</guid>
      <dc:creator>Shelton</dc:creator>
      <dc:date>2018-03-16T20:54:00Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194958#M157017</link>
      <description>&lt;P&gt;I had defined these configuration. I got Namenode shutdown automatically while Standbynamenode was running today.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Mar 2018 13:57:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194958#M157017</guid>
      <dc:creator>vinayk</dc:creator>
      <dc:date>2018-03-19T13:57:38Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194959#M157018</link>
      <description>&lt;P&gt;&lt;EM&gt;@&lt;A href="https://community.hortonworks.com/users/69412/testtest12p.html"&gt;Vinay K&lt;/A&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Can you confirm that the Active and Standby Namenodes were running normally and then suddenly the primary went down? Is there something, in particular, you did before the shutdown?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Is the time on the QJM's servers in sync?  Is this a test cluster?  If so and you don't risk losing date then you could proceed with  &lt;BR /&gt;stopping  the Hdfs service Start only the journal nodes (as they will need to be made aware of the formatting) &lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;On the first namenode (as user hdfs) &lt;/EM&gt;&lt;/P&gt;&lt;PRE&gt;&lt;EM&gt;$ hadoop namenode -format hdfs namenode -initializeSharedEdits -force &lt;/EM&gt;&lt;/PRE&gt;&lt;P&gt;&lt;EM&gt;For the journal nodes &lt;/EM&gt;&lt;/P&gt;&lt;PRE&gt;$ hdfs zkfc -formatZK -force &lt;/PRE&gt;&lt;P&gt;&lt;EM&gt;To force zookeeper to reinitialise) restart that first namenode. &lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;On the second namenode &lt;/EM&gt;&lt;/P&gt;&lt;PRE&gt;$ hdfs namenode -bootstrapStandby -force &lt;/PRE&gt;&lt;P&gt;&lt;I&gt;(force synch with first namenode) &lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;On every datanode clear the data directory Restart the HDFS service This was a very simple step by step guide to formatting.&lt;/I&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Mar 2018 15:17:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194959#M157018</guid>
      <dc:creator>Shelton</dc:creator>
      <dc:date>2018-03-19T15:17:26Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194960#M157019</link>
      <description>&lt;P&gt;@Vinay K&lt;/P&gt;&lt;P&gt;Problem:&lt;/P&gt;&lt;P&gt;There are two problem as per the logs shared :&lt;/P&gt;&lt;P&gt;1) Active NameNode is rolls the edit logs in it's local disk and by using RPC call it send that to journal node. ANN using flush function like the Native call happens in Unix/Linux write to journal nodes. So, here the fatal error shown in log file could be due to: &lt;/P&gt;&lt;P&gt;a) As per log problem is with only one journal node i.e. (10.10.20.5). Might be Journal node process is not running on host. You can check them using :&lt;/P&gt;&lt;PRE&gt;ps -eaf| grep journal&lt;/PRE&gt;&lt;P&gt;b) RPC ports not actively listening on journal node, check them using below command:&lt;/P&gt;&lt;PRE&gt;netstat -ntlp |grep  8485&lt;/PRE&gt;&lt;P&gt;c) Stop firewalld/Iptable services on the ANN(Active namenode), also on Journal node. to make sure these are not blocking the RPC call.&lt;/P&gt;&lt;PRE&gt;systemctl stop firewalld&lt;/PRE&gt;&lt;P&gt;d) Another probable cause could be your disk is heavily busy on that specific JN which is resulting in time out. Check that using iostat command in linux/unix.&lt;/P&gt;&lt;PRE&gt;iostat&lt;/PRE&gt;&lt;P&gt;And check for the disk i/o where your edit logs are being saved.&lt;/P&gt;&lt;P&gt;2) The second error is due to the problem with this journal node (10.10.20.5). Once you rectify the problem with this journal node I think you will be sorted.&lt;/P&gt;&lt;P&gt;Also, one more thing to add please check if the time on all journal node are same and in sync. If you have NTP service running on your server please check if NTP server is picking up the right time.&lt;/P&gt;&lt;P&gt;You can check the time on these node using date command:&lt;/P&gt;&lt;PRE&gt;date&lt;/PRE&gt;</description>
      <pubDate>Mon, 19 Mar 2018 18:11:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194960#M157019</guid>
      <dc:creator>sandeepksaini</dc:creator>
      <dc:date>2018-03-19T18:11:57Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194961#M157020</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;There was no activity performed before its shutdown. Its normally running. Before going to shutdown, i found some logs on primary node:&lt;/P&gt;&lt;P&gt;2018-03-17 06:51:06,409 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://slave1.bd-ds.com:8480 /getJournal?jid=bddscluster&amp;amp;segmentTxId=6051162&amp;amp;storageInfo=-63%3A1566454145%3A0%3ACID-26a3ddc9-57c2-49c8-848d-3e28567cd7e7' to transaction ID 6051162 2018-03-17 06:51:06,417 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(145)) - Edits file &lt;A href="http://slave1.bd-ds.com:8480/getJournal?jid=bddscluster&amp;amp;segmentTxId=" target="_blank"&gt;http://slave1.bd-ds.com:8480/getJournal?jid=bddscluster&amp;amp;segmentTxId=&lt;/A&gt; 6051162&amp;amp;storageInfo=-63%3A1566454145%3A0%3ACID-26a3ddc9-57c2-49c8-848d-3e28567cd7e7, &lt;A href="http://slave2.bd-ds.com:8480/getJournal?jid=bddscluster&amp;amp;segmentTxId=6051162&amp;amp;storage" target="_blank"&gt;http://slave2.bd-ds.com:8480/getJournal?jid=bddscluster&amp;amp;segmentTxId=6051162&amp;amp;storage&lt;/A&gt; Info=-63%3A1566454145%3A0%3ACID-26a3ddc9-57c2-49c8-848d-3e28567cd7e7, &lt;A href="http://slave6.bd-ds.com:8480/getJournal?jid=bddscluster&amp;amp;segmentTxId=6051162&amp;amp;storageInfo=-63%3A1566" target="_blank"&gt;http://slave6.bd-ds.com:8480/getJournal?jid=bddscluster&amp;amp;segmentTxId=6051162&amp;amp;storageInfo=-63%3A1566&lt;/A&gt; 454145%3A0%3ACID-26a3ddc9-57c2-49c8-848d-3e28567cd7e7 of size 47892 edits # 300 loaded in 0 seconds 2018-03-17 06:51:06,417 INFO ha.EditLogTailer (EditLogTailer.java:doTailEdits(275)) - Loaded 300 edits starting from txid 6051161 2018-03-17 06:53:06,436 INFO ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(323)) - Triggering log roll on remote NameNode 2018-03-17 06:53:26,458 WARN ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(339)) - Unable to trigger a roll of the active NN java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: Call From slave0.bd-ds.com/10.10.20.7 to slave1.bd-ds.com:8020 failed on socket timeout except ion: java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.10.2 0.7:51810 remote=slave1.bd-ds.com/10.10.20.12:8020]; For more details see: &lt;A href="http://wiki.apache.org/hadoop/SocketTimeout" target="_blank"&gt;http://wiki.apache.org/hadoop/SocketTimeout&lt;/A&gt; at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:327) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:386) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:368) Caused by: java.net.SocketTimeoutException: Call From slave0.bd-ds.com/10.10.20.7 to slave1.bd-ds.com:8020 failed on socket timeout exception: java.net.SocketTimeoutExc eption: 20000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.10.20.7:51810 remote=slave1.bd-ds. com/10.10.20.12:8020]; For more details see: &lt;A href="http://wiki.apache.org/hadoop/SocketTimeout" target="_blank"&gt;http://wiki.apache.org/hadoop/SocketTimeout&lt;/A&gt;
&lt;/P&gt;&lt;P&gt;Some zooKeeper logs are:&lt;/P&gt;&lt;P&gt;2018-03-17 06:55:19,565 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(1146)) - Session 0x1622ffcd77e0000 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2018-03-17 06:55:21,461 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server 10.10.20.12/10.10.20.12:2181. Will not attempt to authenticate using SASL (unknown error) 2018-03-17 06:55:22,284 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1140)) - Client session timed out, have not heard from server in 2618ms for sessionid 0x1622ffcd 77e0000, closing socket connection and attempting reconnect 2018-03-17 06:55:22,392 FATAL ha.ActiveStandbyElector (ActiveStandbyElector.java:fatalError(695)) - Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retryin g further znode monitoring connection errors. 2018-03-17 06:55:22,393 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK connection for elector id=324457684 appData= 0a0b62646473636c757374657212036e6e311a10736c617665302e62642d64732e636f6d20d43e28d33e cb=Elector callbacks for NameNode at slave0.bd-ds.com/10.10.20.7:8020 2018-03-17 06:55:23,193 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x1622ffcd77e0000 closed 2018-03-17 06:55:23,193 FATAL ha.ZKFailoverController (ZKFailoverController.java:fatalError(374)) - Fatal error occurred:Received stat error from Zookeeper. code:CONNEC TIONLOSS. Not retrying further znode monitoring connection errors. 2018-03-17 06:55:23,193 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(524)) - EventThread shut down 2018-03-17 06:55:23,194 INFO ipc.Server (Server.java:stop(2752)) - Stopping server on 8019 2018-03-17 06:55:23,194 INFO ipc.Server (Server.java:run(1069)) - Stopping IPC Server Responder 2018-03-17 06:55:23,194 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(406)) - Yielding from election 2018-03-17 06:55:23,194 INFO ipc.Server (Server.java:run(932)) - Stopping IPC Server listener on 8019 2018-03-17 06:55:23,195 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(832)) - terminateConnection, zkConnectionState = TERMINATED 2018-03-17 06:55:23,195 INFO ha.HealthMonitor (HealthMonitor.java:shutdown(151)) - Stopping HealthMonitor thread 2018-03-17 06:55:23,196 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:terminateConnection(832)) - terminateConnection, zkConnectionState = TERMINATED 2018-03-17 06:55:23,196 FATAL tools.DFSZKFailoverController (DFSZKFailoverController.java:main(193)) - Got a fatal error, exiting now java.lang.RuntimeException: ZK Failover Controller failed: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection err ors. at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:369) at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:238) at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61) at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:191) 2018-03-17 06:55:23,205 INFO tools.DFSZKFailoverController (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
&lt;/P&gt;&lt;P&gt;These are Dev server. Some data held on these server.&lt;/P&gt;&lt;P&gt;Time of QJM server is in sync.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Mar 2018 19:16:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194961#M157020</guid>
      <dc:creator>vinayk</dc:creator>
      <dc:date>2018-03-19T19:16:54Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194962#M157021</link>
      <description>&lt;P&gt;&lt;EM&gt;@&lt;A href="https://community.hortonworks.com/users/69412/testtest12p.html"&gt;Vinay K&lt;/A&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;I can see 2 things  all connect to connectivity &lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;Received stat error from Zookeeper. code: CONNECTIONLOSS. Not retrying further znode monitoring connection errors.&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Can you validate your ZKFailoverControllers are running&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;The root cause of a Socket Timeout is a connectivity failure between the machines.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Check the settings: is this the machine you really wanted to talk to? &lt;/EM&gt;&lt;/P&gt;&lt;PRE&gt;&lt;EM&gt;- From the machine that is raising the exception, can you resolve the hostname. 
- Is that resolved hostname the correct one? Can you ping the remote host? 
- Is the target machine running the relevant Hadoop processes? 
- Can you telnet to the target host and port? 
- Can you telnet to the target host and port from any other machine? 
- On the target machine, can you telnet to the port using localhost as the hostname. If this works but external network connections time out, it's usually a firewall issue.&lt;/EM&gt;&lt;/PRE&gt;&lt;P&gt;&lt;EM&gt;Please check the above  especially the firewall&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Mar 2018 19:44:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194962#M157021</guid>
      <dc:creator>Shelton</dc:creator>
      <dc:date>2018-03-19T19:44:41Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194963#M157022</link>
      <description>&lt;P&gt;@&lt;A href="https://community.hortonworks.com/users/69412/testtest12p.html"&gt;Vinay K&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Thanks for confirming the time sync.&lt;/P&gt;&lt;P&gt;Probable cause could  be from your firewall blocking the TCP communication on network or other process on host using the hadoop designated ports.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Reason for the justification&lt;/STRONG&gt;: In the first log you shared with us has the same problem of RPC connection and due to which it was not able to roll the edit logs.&lt;/P&gt;&lt;P&gt;Now, you have shared logs of namenode and zoo keeper in which it clearly shows the error message "connection refused". For this type of problem even if you increase your time out to unlimited it is not going to help you.&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Check for the below things for me:&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;a) Check if firewall/iptables are turned off. If not then do turn it off as per the command shared :&lt;/P&gt;&lt;PRE&gt;systemctl stop firewalld&lt;/PRE&gt;&lt;P&gt;b) Check if you are able to ping to the slave nodes from master node using ping command:&lt;/P&gt;&lt;PRE&gt;ping &amp;lt;ip_address&amp;gt;&lt;/PRE&gt;&lt;P&gt; c) Check if you host is able to resolve the hostname and ip address of the slave  nodes(Ping using hostnames of slaves):&lt;/P&gt;&lt;P&gt;Run these commands from:&lt;/P&gt;&lt;P&gt;ANN -&amp;gt; JN&lt;/P&gt;&lt;P&gt;ANN -&amp;gt; ZK&lt;/P&gt;&lt;P&gt;ANN -&amp;gt; SNN&lt;/P&gt;&lt;PRE&gt;ping &amp;lt;hostname&amp;gt; 
dig -x &amp;lt;ip_address_of_slave&amp;gt;&lt;/PRE&gt;&lt;P&gt;d) Did you check for the port opened and not being used by any other process? Use the command as shared in my previous post.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Mar 2018 21:05:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194963#M157022</guid>
      <dc:creator>sandeepksaini</dc:creator>
      <dc:date>2018-03-19T21:05:50Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194964#M157023</link>
      <description>&lt;P&gt;
	Hi,&lt;/P&gt;&lt;P&gt;
	I am able to telnet, ping to target machine and hostname.&lt;/P&gt;&lt;P&gt;
	It may be problem has resolved. Today HA NN in running state. I have done some changing. Let me share with you all.&lt;/P&gt;&lt;P&gt;
	1. First i have 3 three zookeeper server running. I had changed the zookeeper location from server 1 to server 4, meanwhile  ha.zookeeper.quorum was still server1.zk.com:2181,server2.zk.com:2181,server3.zk.com:2181 even after restart all services.&lt;/P&gt;&lt;P&gt;
	I have changed it manually to server4.zk.com:2181,server2.zk.com:2181,server3.zk.com:2181.&lt;/P&gt;&lt;P&gt;
	2. Second i had done some modification in zookeeper configuration zoo.cfg&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;
	changed default to syncLimit=15, tickTime=4000 and initLimit=30.&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;
	Now it will take 15*4000=60sec to sync.&lt;/P&gt;&lt;P&gt;3. I have define the suggested properties in configuration files:&lt;/P&gt;&lt;H4&gt;hdfs-site.xml&lt;/H4&gt;&lt;PRE&gt;dfs.qjournal.start-segment.timeout.ms =90000
dfs.qjournal.select-input-streams.timeout.ms =90000
dfs.qjournal.write-txns.timeout.ms =90000&lt;/PRE&gt;&lt;H4&gt;core-site.xml&lt;/H4&gt;&lt;PRE&gt;ipc.client.connect.timeout =90000
&lt;/PRE&gt;&lt;P&gt;4. All three JournalNode installed on Both Namenode instance and third installed on DataNode Server.&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;Today it seems everything is working fine.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Mar 2018 12:13:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194964#M157023</guid>
      <dc:creator>vinayk</dc:creator>
      <dc:date>2018-03-20T12:13:56Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194965#M157024</link>
      <description>&lt;P&gt;&lt;EM&gt; @&lt;A href="https://community.hortonworks.com/users/69412/testtest12p.html"&gt;Vinay K&lt;/A&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;@&lt;A href="https://community.hortonworks.com/users/46293/sandeeprhct.html"&gt;Sandeep Kumar&lt;/A&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;It's great that your problem has been resolved. It isn't normal that someone attributes himself the correct answer when other HCC members contributed to the answer namely Sandeep and I.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Your solution which was suggested by me.&lt;/EM&gt;&lt;/P&gt;&lt;PRE&gt;1 Quorum of zookeeper 
3. Changes in the hdfs-site.xml/core-site.xml config 
4 Journal nodes /Sandeep too&lt;/PRE&gt;&lt;P&gt;&lt;I&gt;So taking account the above I guess someone else merited the point &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; &lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/users/69412/testtest12p.html"&gt; &lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Mar 2018 17:54:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194965#M157024</guid>
      <dc:creator>Shelton</dc:creator>
      <dc:date>2018-03-20T17:54:12Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194966#M157025</link>
      <description>&lt;P&gt;@Geoffrey Shelton&lt;/P&gt;&lt;P&gt;If you not comfort i can remove 2 point. I'd found 2nd point  in zookeeper documentation.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Mar 2018 17:59:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194966#M157025</guid>
      <dc:creator>vinayk</dc:creator>
      <dc:date>2018-03-20T17:59:51Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194967#M157026</link>
      <description>&lt;P&gt;&lt;EM&gt;@&lt;A href="https://community.hortonworks.com/users/69412/testtest12p.html"&gt;Vinay K&lt;/A&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Not important to me though but ethically . I am fine with all &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Mar 2018 18:19:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194967#M157026</guid>
      <dc:creator>Shelton</dc:creator>
      <dc:date>2018-03-20T18:19:54Z</dc:date>
    </item>
    <item>
      <title>Re: NameNode and Standbynamenode auto stopped</title>
      <link>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194968#M157027</link>
      <description>&lt;P&gt;@Geoffrey S. O.&lt;/P&gt;&lt;P&gt;May be that point is not meaningful.&lt;/P&gt;&lt;P&gt;Thanks for help me. You guys spent your precious time in community, That is appreciable. &lt;/P&gt;</description>
      <pubDate>Tue, 20 Mar 2018 18:27:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/NameNode-and-Standbynamenode-auto-stopped/m-p/194968#M157027</guid>
      <dc:creator>vinayk</dc:creator>
      <dc:date>2018-03-20T18:27:05Z</dc:date>
    </item>
  </channel>
</rss>

