1) zkfc log at : Shows that ZKFC failed to connect to NN process at 11:19 (timeout after 45seconds) and was able to establish connection at 11:21. But it failed to set its state to Standby with Connection Reset error: 2015-10-28 11:19:50,031 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(198)) - Transport-level exception trying to monitor health of NameNode at /:8020: Call From / to :8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/:58463 remote=/:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout 2015-10-28 11:19:50,031 INFO ha.HealthMonitor (HealthMonitor.java:enterState(224)) - Entering state SERVICE_NOT_RESPONDING [...] 2015-10-28 11:19:50,031 INFO ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(752)) - Quitting master election for NameNode at /:8020 and marking that fencing is necessary [...] ----> At 11:21, the connection to NN port is established again: 2015-10-28 11:21:26,316 INFO ha.HealthMonitor (HealthMonitor.java:enterState(224)) - Entering state SERVICE_HEALTHY 2015-10-28 11:21:26,317 INFO ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(797)) - Local service NameNode at /:8020 entered state: SERVICE_HEALTHY 2015-10-28 11:21:26,320 INFO zookeeper.ZooKeeper (ZooKeeper.java:(438)) - Initiating client connection, connectString=:2181,:2181,:2181 sessionTimeout=10000 watcher=or g.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@7a5c929d [...] -----> ZKFC attempts to put the NN into Standby state, and fails: 2015-10-28 11:21:26,383 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeStandby(474)) - ZK Election indicated that NameNode at /:8020 should become standby 2015-10-28 11:21:26,887 ERROR ha.ZKFailoverController (ZKFailoverController.java:becomeStandby(482)) - Couldn't transition NameNode at /:8020 to standby state java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "/"; destination host is: "":8020; 2) zkfc log at : Shows that the NN on this node became active at 11:19 , probably when NN on became unreachable by its ZKFC : 2015-10-28 11:19:58,781 INFO ha.NodeFencer (NodeFencer.java:fence(98)) - ====== Fencing successful by method org.apache.hadoop.ha.ShellCommandFencer(sudo hcli internal namenode_fence --target-host=$target_host --target-port=$target_port --target-nameserviceid=$target_nameserviceid --target-namenodeid=$target_namenodeid) ====== 2015-10-28 11:19:58,783 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:writeBreadCrumbNode(823)) - Writing znode /hadoop-ha//ActiveBreadCrumb to indicate that the local node is the most recent active... 2015-10-28 11:19:58,850 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(371)) - Trying to make NameNode at /:8020 active... 2015-10-28 11:19:59,540 INFO ha.ZKFailoverController (ZKFailoverController.java:becomeActive(378)) - Successfully transitioned NameNode at /:8020 to active state --> Need to find more about the fencing here. 3) The journalnode had updated the lastPromisedEpoch and lastWriterEpoch to 191 , since the Active NN has changed to "" 2015-10-28 11:19:58,872 INFO server.Journal (Journal.java:updateLastPromisedEpoch(315)) - Updating lastPromisedEpoch from 190 to 191 for client / 2015-10-28 11:19:58,873 INFO server.Journal (Journal.java:scanStorageForLatestEdits(188)) - Scanning storage FileJournalManager(root=/data/hadoop/hdfs/journal/) [...] 2015-10-28 11:19:58,935 INFO namenode.FileJournalManager (FileJournalManager.java:finalizeLogSegment(130)) - Finalizing edits file /data/hadoop/hdfs/journal//current/edits_inprogress_0000000000 125591849 -> /data/hadoop/hdfs/journal//current/edits_0000000000125591849-0000000000125592835 2015-10-28 11:19:59,262 INFO server.Journal (Journal.java:startLogSegment(532)) - Updating lastWriterEpoch from 190 to 191 for client / 2015-10-28 11:21:26,302 INFO ipc.Server (Server.java:run(2034)) - IPC Server handler 4 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from :44897 Call#45260 Retry#0 java.io.IOException: IPC's epoch 190 is less than the last promised epoch 191 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414) --> The last error hit here indicates that the NN "" ( ) attempted to write to the Journalnode, which failed because the EPOCH has been updated on the journalnode when the active NN was changed. 4) From the NN logs on : grep for "active|standby" 015-10-28 10:25:47,667 INFO namenode.FSNamesystem (FSNamesystem.java:stopStandbyServices(1172)) - Stopping services started for standby state 2015-10-28 10:25:47,671 INFO namenode.FSNamesystem (FSNamesystem.java:startActiveServices(988)) - Starting services required for active state 2015-10-28 10:25:47,790 INFO namenode.FSNamesystem (FSNamesystem.java:startActiveServices(999)) - Catching up to latest edits from old active before taking over writer role in edits logs 2015-10-28 10:25:48,109 INFO namenode.FSNamesystem (FSNamesystem.java:startActiveServices(1010)) - Reprocessing replication and invalidation queues 2015-10-28 10:25:48,109 INFO namenode.FSNamesystem (FSNamesystem.java:startActiveServices(1021)) - Will take over writing edit logs at txnid 125574573 2015-10-28 11:20:48,401 INFO ipc.Server (Server.java:run(1990)) - IPC Server handler 15 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from :54415 Call#419748 Retry#0 ---> So at 2015-10-28 11:21 , the NN "" still thinks that it is still the active NN. But it realized the mistake only when the JournalNode had an updated EPOCH# , which didn't match its IPC's Epoch: 2015-10-28 11:21:26,308 INFO ipc.Server (Server.java:run(1990)) - IPC Server handler 5 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from :49835 Call#406858 Retry#0 2015-10-28 11:21:26,311 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(360)) - Remote journal :8485 failed to write txns 125592836-125592837. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 190 is less than the last promised epoch 191 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) Earlier from zkfc log, we found that zkfc lost connection with NN at 11:19:50. And looking at the NN log at around the same time (on ): 2015-10-28 11:18:45,131 DEBUG BlockStateChange (NameNodeRpcServer.java:cacheReport(1043)) - *BLOCK* NameNode.cacheReport: from DatanodeRegistration(, datanodeUuid=92d866e6-f3f3-4940-b152-a3c8cae8cb 91, infoPort=50075, ipcPort=8010, storageInfo=lv=-55;cid=CID-3e72016d-ed18-433e-bb31-4b21afc4b20b;nsid=656505735;c=0) 28464 blocks 2015-10-28 11:18:47,007 DEBUG BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1407)) - BLOCK* neededReplications = 0 pendingReplications = 0 2015-10-28 11:18:47,906 INFO namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0 2015-10-28 11:18:48,387 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 30001 milliseconds 2015-10-28 11:19:32,560 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128182 block(s) in 44172 millisecond(s). 2015-10-28 11:19:32,560 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 44172 milliseconds 2015-10-28 11:20:10,493 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128173 block(s) in 37934 millisecond(s). 2015-10-28 11:20:10,493 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 37934 milliseconds 2015-10-28 11:20:10,495 INFO ipc.Server (Server.java:run(1990)) - IPC Server handler 5 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from :58463 Call#406856 Retry#0 2015-10-28 11:20:10,496 INFO namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0 2015-10-28 11:20:48,399 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128173 block(s) in 37906 millisecond(s). 2015-10-28 11:20:48,399 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 37906 milliseconds 2015-10-28 11:20:48,400 DEBUG BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1407)) - BLOCK* neededReplications = 0 pendingReplications = 0 2015-10-28 11:20:48,401 INFO namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0 ---> So the NN process was running, but could be too busy doing things that prevented it from responding to zkfc connections for 45seconds.