Support Questions

Find answers, ask questions, and share your expertise

ZooKeeper Failover controller crashes when the Hadoop NameNode goes down

avatar

We are setting up a Hadoop Cluster in our development environment. While we were testing the fail over of the NameNode we noticed that the Zookeeper Failover controller would sometimes crash. In one case both ZooKeeper and the ZooKeeper Failover controller crashed.

At one point in our testing both NameNodes were in an active state. This would cause a split brain scenario in the Hadoop Cluster.

We have not seen any useful information in the logs.

We are using the following versions: - hadoop-2.7.3 - zookeeper-3.4.10

We have two a four server cluster. Two of the servers are dedicated to NameNode and two of the servers are dedicated to DataNodes.

The components running on the NameNode servers are - NameNode - ZooKeeper - ZooKeeper Failover controller - JournalNode

The components running on the DataNode servers are - DataNode - ZooKeeper - JournalNode

The following matrix contains the test scenarios. After the matrix we have the contents of the core-site.xml and hdfs-site.xml.

NameNode Server 1 NameNode Server 2
Name node 1 is active NameNode 2 is standby
Kill -9 pid NameNode 1
Name Node 1 is down
NameNode 2 is active
Start NameNode 1
Zoo Keeper Failover 1 crashes
NameNode 1 is standby
NameNode 2 is active
Start Zoo Keeper Failover Controller1
NameNode 1 is active
NameNode 2 is standby
Kill -9 pid NameNode 1
Name Node 1 is down
NameNode 2 is active
Start NameNode 1
Zoo Keeper Failover 1 crashes
NameNode 1 is standby
No useful information in the Zoo Keeper
Failover Controller Logs
NameNode 2 is Aactive
Turn on Log4j debugging
Start Zoo Keeper Failover Controller1
Zoo Keeper Fail Over Controller does
not start
NameNode 2 is Aactive
Turn on Log4j debugging,console
Start Zoo Keeper Failover Controller1
Zoo Keeper Fail Over Controller does
not start
Logs: unable to start failover controller Parent znode does not exist
Logs: run with -formatZK to initalize Zookeeper
NameNode 2 is active
run: hdfs zkfc -formatZK
Start Zoo Keeper Failover Controller 1
Name node 1 is active
NameNode 2 is active
Name node 1 is active Stop Name Node 2
Zoo Keeper Failover Controller 2 crashed
Zoo Keeper crashed
NameNode 1 is standby Start Zoo Keeper
Start Zoo Keeper Failover Controller 2
Start Name Node 2
Name Node 2 active

core-site.xml

<configuration>
    <property>
         <name>hadoop.tmp.dir</name>
         <value>/opt/xpm/hadoop_tmp</value>
         <description>A base for other temporary directories.</description>
    </property>
    <property>
         <name>fs.defaultFS</name>
         <value>hdfs://ha-cluster</value>
    </property>
    <property>
         <name>dfs.jornalnode.edits.dir</name>
         <value>/opt/xpm/hadoop_journal</value>
    </property>
</configuration>

hdfs-site.xml

<configuration>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>/opt/xpm/hadoop_namenode</value>
 </property>
 
 <property>
 <name>dfs.replication</name>
 <value>1</value>
 </property>
 
 <property>
 <name>dfs.permissions</name>
 <value>false</value>
 </property>
 
 <property>
 <name>dfs.nameservices</name>
 <value>ha-cluster</value>
 </property>
 
 <property>
 <name>dfs.ha.namenodes.ha-cluster</name>
 <value>nn1,nn2</value>
 </property>
 
 <property>
 <name>dfs.namenode.rpc-address.ha-cluster.nn1</name>
 <value>r00j9rn0c.bnymellon.net:9000</value>
 </property>
 
 <property>
 <name>dfs.namenode.rpc-address.ha-cluster.nn2</name>
 <value>r00j9sn0c.bnymellon.net:9000</value>
 </property>
 
 <property>
 <name>dfs.namenode.http-address.ha-cluster.nn1</name>
 <value>r00j9rn0c.bnymellon.net:50070</value>
 </property>
 
 <property>
 <name>dfs.namenode.http-address.ha-cluster.nn2</name>
 <value>r00j9sn0c.bnymellon.net:50070</value>
 </property>
 
 <property>
 <name>dfs.namenode.shared.edits.dir</name> 
 <value>qjournal://r00j9rn0c.bnymellon.net:8485;r00j9sn0c.bnymellon.net;r00j9tn0c.bnymellon.net:8485;r00j9un0c.bnymellon.net:8485/ha-cluster</value>
 </property>
 
 <property>
 <name>dfs.client.failover.proxy.provider.ha-cluster</name>
 <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
 </property>
 
 <property>
 <name>dfs.ha.automatic-failover.enabled</name>
 <value>true</value>
 </property>
 
 <property>
 <name>ha.zookeeper.quorum</name>
 <value>r00j9rn0c.bnymellon.net:2181,r00j9sn0c.bnymellon.net:2181,r00j9tn0c.bnymellon.net:2181,r00j9un0c.bnymellon.net:2181</value>
 </property>
 
 <property>
 <name>dfs.ha.fencing.methods</name>
 <value>sshfence</value>
 </property>
 
 <property>
 <name>dfs.ha.fencing.ssh.private-key-files</name>
 <value>/users/home/pkimd1m/.ssh/id_rsa</value>
 </property>


</configuration>
3 REPLIES 3

avatar

We see the following FATAL error messages in the ZooKeeper Failover controller Logs.

17/08/07 16:13:38 FATAL ha.ActiveStandbyElector: Received create error from Zookeeper. code:NONODE for path /hadoop-ha/ha-cluster/ActiveStandbyElectorLock 17/08/07 16:13:38 DEBUG ha.ActiveStandbyElector: Terminating ZK connection for elector id=1435256152 appData=0a0a68612d636c757374657212036e6e311a177230306a39726e30632e626e796d656c6c6f6e2e6e657420a84628d33e cb=Elector callbacks for NameNode at r00j9rn0c.bnymellon.net/10.61.6.20:9000 17/08/07 16:13:38 DEBUG zookeeper.ZooKeeper: Closing session: 0x15dbd57822b000f 17/08/07 16:13:38 DEBUG zookeeper.ClientCnxn: Closing client for session: 0x15dbd57822b000f 17/08/07 16:13:38 DEBUG zookeeper.ClientCnxn: Reading reply sessionid:0x15dbd57822b000f, packet:: clientPath:null serverPath:null finished:false header:: 6,-11 replyHeader:: 6,73,0 request:: null response:: null 17/08/07 16:13:38 DEBUG zookeeper.ClientCnxn: Disconnecting client for session: 0x15dbd57822b000f 17/08/07 16:13:38 INFO zookeeper.ZooKeeper: Session: 0x15dbd57822b000f closed 17/08/07 16:13:38 FATAL ha.ZKFailoverController: Fatal error occurred:Received create error from Zookeeper. code:NONODE for path /hadoop-ha/ha-cluster/ActiveStandbyElectorLock 17/08/07 16:13:38 INFO zookeeper.ClientCnxn: EventThread shut down 17/08/07 16:13:38 INFO ipc.Server: Stopping server on 8019 17/08/07 16:13:38 DEBUG ipc.Server: IPC Server handler 0 on 8019: exiting 17/08/07 16:13:38 DEBUG ipc.Server: IPC Server handler 2 on 8019: exiting 17/08/07 16:13:38 DEBUG ipc.Server: IPC Server handler 1 on 8019: exiting 17/08/07 16:13:38 INFO ha.ActiveStandbyElector: Yielding from election 17/08/07 16:13:38 INFO ipc.Server: Stopping IPC Server listener on 8019 17/08/07 16:13:38 INFO ipc.Server: Stopping IPC Server Responder 17/08/07 16:13:38 INFO ha.HealthMonitor: Stopping HealthMonitor thread 17/08/07 16:13:38 FATAL tools.DFSZKFailoverController: Got a fatal error, exiting now java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:NONODE for path /hadoop-ha/ha-cluster/ActiveStandbyElectorLock at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:369) at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:238) at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61) at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172) at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)

avatar

When we tried to start ZooKeeper Fail Over Controller. we got this error on node 1.

17/08/07 16:24:41 DEBUG zookeeper.ClientCnxn: Reading reply sessionid:0x15dbd57ff59000c, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,35,-101 request:: '/hadoop-ha/ha-cluster,F response:: 17/08/07 16:24:41 FATAL ha.ZKFailoverController: Unable to start failover controller. Parent znode does not exist. Run with -formatZK flag to initialize ZooKeeper.

We were starting/stopping NameNode and ZooKeeper Failover controller on the other node (node 2) to resolve an Active-active NameNode scenario. For some unexplained reason the ZooKeeper Fail Over Controller crashed on node 1.

We reformated ZooKeeper Fail Over Controller on node 1. restarted NameNode and ZooKeeper Fail Over Controller crashed on node 2. At this point we still had the active-active namenode issue.

On node 2 we reformated the filesystem by using this command hdfs namenode -bootstrapStandby

On node 2 we reformated ZooKeeper Dail Over Controller with this command. hadoop-daemon.sh start zkfc

On Node 1 NameNode crashed. After we brought up NodeName 1 on node 1 we are still in the active-active NodeName scenario.

Logs from Node Name on Node 1

17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.6.20:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at org.apache.hadoop.ipc.Client.call(Client.java:1475) at org.apache.hadoop.ipc.Client.call(Client.java:1412) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy10.journal(Unknown Source) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.4.25:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at org.apache.hadoop.ipc.Client.call(Client.java:1475) at org.apache.hadoop.ipc.Client.call(Client.java:1412) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy10.journal(Unknown Source) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9un0c.bnymellon.net/10.61.6.21:8485 from pkimd1m got value #142 17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.6.21:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at org.apache.hadoop.ipc.Client.call(Client.java:1475) at org.apache.hadoop.ipc.Client.call(Client.java:1412) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy10.journal(Unknown Source) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9sn0c.bnymellon.net/10.61.4.24:8485 from pkimd1m: starting, having connections 4 17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9sn0c.bnymellon.net/10.61.4.24:8485 from pkimd1m sending #140 17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9sn0c.bnymellon.net/10.61.4.24:8485 from pkimd1m got value #140 17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.4.24:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at org.apache.hadoop.ipc.Client.call(Client.java:1475) at org.apache.hadoop.ipc.Client.call(Client.java:1412) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy10.journal(Unknown Source) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 17/08/07 16:40:38 FATAL namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.61.6.20:8485, 10.61.4.24:8485, 10.61.4.25:8485, 10.61.6.21:8485], stream=QuorumOutputStream starting at txid 175)) org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 3/4. 4 exceptions thrown: 10.61.6.21:8485: IPC's epoch 23 is less than the last promised epoch 25 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) 10.61.4.24:8485: IPC's epoch 23 is less than the last promised epoch 25 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) 10.61.6.20:8485: IPC's epoch 23 is less than the last promised epoch 25 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) 10.61.4.25:8485: IPC's epoch 23 is less than the last promised epoch 25 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81) at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223) at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1266) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1203) at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1300) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:5836) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1122) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142) at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) 17/08/07 16:40:38 WARN client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 175 17/08/07 16:40:38 INFO util.ExitUtil: Exiting with status 1 17/08/07 16:40:38 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at r00j9rn0c.bnymellon.net/10.61.6.20 ************************************************************/

avatar

At this point we brought down the journalNode, NameNode, ZookKeeper FailOver Controlle and Zookeeper.

We deleted the tmp files for the journame node and the files foe zoo keeper.

Did the following on node 2

hdfs namenode -bootstrapStandby

staterted name node on node 2

hadoop-daemon.sh start zkfc

start zkfc on node 2

This did not resolve the active-active Name Node issue.