1) zkfc log at <HOST-A_Alias> :  Shows that ZKFC failed to connect to NN process at 11:19 (timeout after 45seconds) and was able to establish connection at 11:21. But it failed to set its state to Standby with Connection Reset error:

2015-10-28 11:19:50,031 WARN  ha.HealthMonitor (HealthMonitor.java:doHealthChecks(198)) - Transport-level exception trying to monitor health of NameNode at <HOST-A>/<IP-A>:8020: Call From <HOST-A_Alias>/<IP-A> to <HOST-A>:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/<IP-A>:58463 remote=<HOST-A>/<IP-A>:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
2015-10-28 11:19:50,031 INFO  ha.HealthMonitor (HealthMonitor.java:enterState(224)) - Entering state SERVICE_NOT_RESPONDING
[...]
2015-10-28 11:19:50,031 INFO  ha.ZKFailoverController (ZKFailoverController.java:recheckElectability(752)) - Quitting master election for NameNode at <HOST-A>/<IP-A>:8020 and marking that fencing is necessary
[...]

----> At 11:21, the connection to NN port is established again:

2015-10-28 11:21:26,316 INFO  ha.HealthMonitor (HealthMonitor.java:enterState(224)) - Entering state SERVICE_HEALTHY
2015-10-28 11:21:26,317 INFO  ha.ZKFailoverController (ZKFailoverController.java:setLastHealthState(797)) - Local service NameNode at <HOST-A>/<IP-A>:8020 entered state: SERVICE_HEALTHY
2015-10-28 11:21:26,320 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=<HOST-A>:2181,<HOST-B>:2181,<HOST-C>:2181 sessionTimeout=10000 watcher=or
g.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@7a5c929d
[...]

-----> ZKFC attempts to put the NN into Standby state, and fails:

2015-10-28 11:21:26,383 INFO  ha.ZKFailoverController (ZKFailoverController.java:becomeStandby(474)) - ZK Election indicated that NameNode at <HOST-A>/<IP-A>:8020 should become standby
2015-10-28 11:21:26,887 ERROR ha.ZKFailoverController (ZKFailoverController.java:becomeStandby(482)) - Couldn't transition NameNode at <HOST-A>/<IP-A>:8020 to standby state
java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "<HOST-A_Alias>/<IP-A>"; destination host is: "<HOST-A>":8020;


2) zkfc log at <HOST-B_Alias> :  Shows that the NN on this node became active at 11:19 , probably when NN on <HOST-B_Alias> became unreachable by its ZKFC :

2015-10-28 11:19:58,781 INFO  ha.NodeFencer (NodeFencer.java:fence(98)) - ====== Fencing successful by method org.apache.hadoop.ha.ShellCommandFencer(sudo hcli internal namenode_fence --target-host=$target_host --target-port=$target_port --target-nameserviceid=$target_nameserviceid --target-namenodeid=$target_namenodeid) ======
2015-10-28 11:19:58,783 INFO  ha.ActiveStandbyElector (ActiveStandbyElector.java:writeBreadCrumbNode(823)) - Writing znode /hadoop-ha/<NNHA_SERVICENAME>/ActiveBreadCrumb to indicate that the local node is the most recent active...
2015-10-28 11:19:58,850 INFO  ha.ZKFailoverController (ZKFailoverController.java:becomeActive(371)) - Trying to make NameNode at <HOST-B>/<IP-B>:8020 active...
2015-10-28 11:19:59,540 INFO  ha.ZKFailoverController (ZKFailoverController.java:becomeActive(378)) - Successfully transitioned NameNode at <HOST-B>/<IP-B>:8020 to active state

--> Need to find more about the fencing here.

3) The journalnode had updated the lastPromisedEpoch and lastWriterEpoch to 191 , since the Active NN has changed to "<HOST-B_Alias>"

2015-10-28 11:19:58,872 INFO  server.Journal (Journal.java:updateLastPromisedEpoch(315)) - Updating lastPromisedEpoch from 190 to 191 for client /<IP-B>
2015-10-28 11:19:58,873 INFO  server.Journal (Journal.java:scanStorageForLatestEdits(188)) - Scanning storage FileJournalManager(root=/data/hadoop/hdfs/journal/<NNHA_SERVICENAME>)
[...]
2015-10-28 11:19:58,935 INFO  namenode.FileJournalManager (FileJournalManager.java:finalizeLogSegment(130)) - Finalizing edits file /data/hadoop/hdfs/journal/<NNHA_SERVICENAME>/current/edits_inprogress_0000000000
125591849 -> /data/hadoop/hdfs/journal/<NNHA_SERVICENAME>/current/edits_0000000000125591849-0000000000125592835
2015-10-28 11:19:59,262 INFO  server.Journal (Journal.java:startLogSegment(532)) - Updating lastWriterEpoch from 190 to 191 for client /<IP-B>
2015-10-28 11:21:26,302 INFO  ipc.Server (Server.java:run(2034)) - IPC Server handler 4 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from <IP-A>:44897 Call#45260 Retry#0
java.io.IOException: IPC's epoch 190 is less than the last promised epoch 191
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
        
--> The last error hit here indicates that the NN "<IP-A>" ( <HOST-A_Alias> ) attempted to write to the Journalnode, which failed because the EPOCH has been updated on the journalnode when the active NN was changed.

4) From the NN logs on <HOST-A_Alias>:  grep for "active|standby"
        
015-10-28 10:25:47,667 INFO  namenode.FSNamesystem (FSNamesystem.java:stopStandbyServices(1172)) - Stopping services started for standby state
2015-10-28 10:25:47,671 INFO  namenode.FSNamesystem (FSNamesystem.java:startActiveServices(988)) - Starting services required for active state
2015-10-28 10:25:47,790 INFO  namenode.FSNamesystem (FSNamesystem.java:startActiveServices(999)) - Catching up to latest edits from old active before taking over writer role in edits logs
2015-10-28 10:25:48,109 INFO  namenode.FSNamesystem (FSNamesystem.java:startActiveServices(1010)) - Reprocessing replication and invalidation queues
2015-10-28 10:25:48,109 INFO  namenode.FSNamesystem (FSNamesystem.java:startActiveServices(1021)) - Will take over writing edit logs at txnid 125574573
2015-10-28 11:20:48,401 INFO  ipc.Server (Server.java:run(1990)) - IPC Server handler 15 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from <IP-B>:54415 Call#419748 Retry#0

--->  So at 2015-10-28 11:21 , the NN "<HOST-A_Alias>" still thinks that it is still the active NN.  But it realized the mistake only when the JournalNode had an updated EPOCH# , which didn't match its IPC's Epoch:

2015-10-28 11:21:26,308 INFO  ipc.Server (Server.java:run(1990)) - IPC Server handler 5 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from <IP-A>:49835 Call#406858 Retry#0
2015-10-28 11:21:26,311 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(360)) - Remote journal <IP-A>:8485 failed to write txns 125592836-125592837. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 190 is less than the last promised epoch 191
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
        

Earlier from <HOST-A_Alias> zkfc log, we found that zkfc lost connection with NN at 11:19:50. And looking at the NN log at around the same time (on <HOST-A_Alias>):

2015-10-28 11:18:45,131 DEBUG BlockStateChange (NameNodeRpcServer.java:cacheReport(1043)) - *BLOCK* NameNode.cacheReport: from DatanodeRegistration(<DN-x-IP>, datanodeUuid=92d866e6-f3f3-4940-b152-a3c8cae8cb
91, infoPort=50075, ipcPort=8010, storageInfo=lv=-55;cid=CID-3e72016d-ed18-433e-bb31-4b21afc4b20b;nsid=656505735;c=0) 28464 blocks
2015-10-28 11:18:47,007 DEBUG BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1407)) - BLOCK* neededReplications = 0 pendingReplications = 0
2015-10-28 11:18:47,906 INFO  namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0
2015-10-28 11:18:48,387 INFO  blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 30001 milliseconds
2015-10-28 11:19:32,560 INFO  blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128182 block(s) in 44172 millisecond(s).
2015-10-28 11:19:32,560 INFO  blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 44172 milliseconds
2015-10-28 11:20:10,493 INFO  blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128173 block(s) in 37934 millisecond(s).
2015-10-28 11:20:10,493 INFO  blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 37934 milliseconds
2015-10-28 11:20:10,495 INFO  ipc.Server (Server.java:run(1990)) - IPC Server handler 5 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from <IP-A>:58463 Call#406856 Retry#0
2015-10-28 11:20:10,496 INFO  namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0
2015-10-28 11:20:48,399 INFO  blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128173 block(s) in 37906 millisecond(s).
2015-10-28 11:20:48,399 INFO  blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 37906 milliseconds
2015-10-28 11:20:48,400 DEBUG BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1407)) - BLOCK* neededReplications = 0 pendingReplications = 0
2015-10-28 11:20:48,401 INFO  namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0        

---> So the NN process was running, but could be too busy doing things that prevented it from responding to zkfc connections for 45seconds.