Created 08-07-2017 08:02 PM
We are setting up a Hadoop Cluster in our development environment. While we were testing the fail over of the NameNode we noticed that the Zookeeper Failover controller would sometimes crash. In one case both ZooKeeper and the ZooKeeper Failover controller crashed.
At one point in our testing both NameNodes were in an active state. This would cause a split brain scenario in the Hadoop Cluster.
We have not seen any useful information in the logs.
We are using the following versions: - hadoop-2.7.3 - zookeeper-3.4.10
We have two a four server cluster. Two of the servers are dedicated to NameNode and two of the servers are dedicated to DataNodes.
The components running on the NameNode servers are - NameNode - ZooKeeper - ZooKeeper Failover controller - JournalNode
The components running on the DataNode servers are - DataNode - ZooKeeper - JournalNode
The following matrix contains the test scenarios. After the matrix we have the contents of the core-site.xml and hdfs-site.xml.
NameNode Server 1 | NameNode Server 2 |
Name node 1 is active | NameNode 2 is standby |
Kill -9 pid NameNode 1 Name Node 1 is down |
NameNode 2 is active |
Start NameNode 1 Zoo Keeper Failover 1 crashes NameNode 1 is standby |
NameNode 2 is active |
Start Zoo Keeper Failover Controller1 NameNode 1 is active |
NameNode 2 is standby |
Kill -9 pid NameNode 1 Name Node 1 is down |
NameNode 2 is active |
Start NameNode 1 Zoo Keeper Failover 1 crashes NameNode 1 is standby No useful information in the Zoo Keeper Failover Controller Logs |
NameNode 2 is Aactive |
Turn on Log4j debugging Start Zoo Keeper Failover Controller1 Zoo Keeper Fail Over Controller does not start | NameNode 2 is Aactive |
Turn on Log4j debugging,console Start Zoo Keeper Failover Controller1 Zoo Keeper Fail Over Controller does not start Logs: unable to start failover controller Parent znode does not exist Logs: run with -formatZK to initalize Zookeeper |
NameNode 2 is active |
run: hdfs zkfc -formatZK Start Zoo Keeper Failover Controller 1 Name node 1 is active |
NameNode 2 is active |
Name node 1 is active | Stop Name Node 2 Zoo Keeper Failover Controller 2 crashed Zoo Keeper crashed |
NameNode 1 is standby | Start
Zoo Keeper Start Zoo Keeper Failover Controller 2 Start Name Node 2 Name Node 2 active |
core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/opt/xpm/hadoop_tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://ha-cluster</value> </property> <property> <name>dfs.jornalnode.edits.dir</name> <value>/opt/xpm/hadoop_journal</value> </property> </configuration>
hdfs-site.xml
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/opt/xpm/hadoop_namenode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.nameservices</name> <value>ha-cluster</value> </property> <property> <name>dfs.ha.namenodes.ha-cluster</name> <value>nn1,nn2</value> </property> <property> <name>dfs.namenode.rpc-address.ha-cluster.nn1</name> <value>r00j9rn0c.bnymellon.net:9000</value> </property> <property> <name>dfs.namenode.rpc-address.ha-cluster.nn2</name> <value>r00j9sn0c.bnymellon.net:9000</value> </property> <property> <name>dfs.namenode.http-address.ha-cluster.nn1</name> <value>r00j9rn0c.bnymellon.net:50070</value> </property> <property> <name>dfs.namenode.http-address.ha-cluster.nn2</name> <value>r00j9sn0c.bnymellon.net:50070</value> </property> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://r00j9rn0c.bnymellon.net:8485;r00j9sn0c.bnymellon.net;r00j9tn0c.bnymellon.net:8485;r00j9un0c.bnymellon.net:8485/ha-cluster</value> </property> <property> <name>dfs.client.failover.proxy.provider.ha-cluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>ha.zookeeper.quorum</name> <value>r00j9rn0c.bnymellon.net:2181,r00j9sn0c.bnymellon.net:2181,r00j9tn0c.bnymellon.net:2181,r00j9un0c.bnymellon.net:2181</value> </property> <property> <name>dfs.ha.fencing.methods</name> <value>sshfence</value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/users/home/pkimd1m/.ssh/id_rsa</value> </property> </configuration>
Created 08-07-2017 08:22 PM
We see the following FATAL error messages in the ZooKeeper Failover controller Logs.
17/08/07 16:13:38 FATAL ha.ActiveStandbyElector: Received create error from Zookeeper. code:NONODE for path /hadoop-ha/ha-cluster/ActiveStandbyElectorLock 17/08/07 16:13:38 DEBUG ha.ActiveStandbyElector: Terminating ZK connection for elector id=1435256152 appData=0a0a68612d636c757374657212036e6e311a177230306a39726e30632e626e796d656c6c6f6e2e6e657420a84628d33e cb=Elector callbacks for NameNode at r00j9rn0c.bnymellon.net/10.61.6.20:9000 17/08/07 16:13:38 DEBUG zookeeper.ZooKeeper: Closing session: 0x15dbd57822b000f 17/08/07 16:13:38 DEBUG zookeeper.ClientCnxn: Closing client for session: 0x15dbd57822b000f 17/08/07 16:13:38 DEBUG zookeeper.ClientCnxn: Reading reply sessionid:0x15dbd57822b000f, packet:: clientPath:null serverPath:null finished:false header:: 6,-11 replyHeader:: 6,73,0 request:: null response:: null 17/08/07 16:13:38 DEBUG zookeeper.ClientCnxn: Disconnecting client for session: 0x15dbd57822b000f 17/08/07 16:13:38 INFO zookeeper.ZooKeeper: Session: 0x15dbd57822b000f closed 17/08/07 16:13:38 FATAL ha.ZKFailoverController: Fatal error occurred:Received create error from Zookeeper. code:NONODE for path /hadoop-ha/ha-cluster/ActiveStandbyElectorLock 17/08/07 16:13:38 INFO zookeeper.ClientCnxn: EventThread shut down 17/08/07 16:13:38 INFO ipc.Server: Stopping server on 8019 17/08/07 16:13:38 DEBUG ipc.Server: IPC Server handler 0 on 8019: exiting 17/08/07 16:13:38 DEBUG ipc.Server: IPC Server handler 2 on 8019: exiting 17/08/07 16:13:38 DEBUG ipc.Server: IPC Server handler 1 on 8019: exiting 17/08/07 16:13:38 INFO ha.ActiveStandbyElector: Yielding from election 17/08/07 16:13:38 INFO ipc.Server: Stopping IPC Server listener on 8019 17/08/07 16:13:38 INFO ipc.Server: Stopping IPC Server Responder 17/08/07 16:13:38 INFO ha.HealthMonitor: Stopping HealthMonitor thread 17/08/07 16:13:38 FATAL tools.DFSZKFailoverController: Got a fatal error, exiting now java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:NONODE for path /hadoop-ha/ha-cluster/ActiveStandbyElectorLock at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:369) at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:238) at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61) at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172) at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
Created 08-07-2017 08:54 PM
When we tried to start ZooKeeper Fail Over Controller. we got this error on node 1.
17/08/07 16:24:41 DEBUG zookeeper.ClientCnxn: Reading reply sessionid:0x15dbd57ff59000c, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,35,-101 request:: '/hadoop-ha/ha-cluster,F response:: 17/08/07 16:24:41 FATAL ha.ZKFailoverController: Unable to start failover controller. Parent znode does not exist. Run with -formatZK flag to initialize ZooKeeper.
We were starting/stopping NameNode and ZooKeeper Failover controller on the other node (node 2) to resolve an Active-active NameNode scenario. For some unexplained reason the ZooKeeper Fail Over Controller crashed on node 1.
We reformated ZooKeeper Fail Over Controller on node 1. restarted NameNode and ZooKeeper Fail Over Controller crashed on node 2. At this point we still had the active-active namenode issue.
On node 2 we reformated the filesystem by using this command hdfs namenode -bootstrapStandby
On node 2 we reformated ZooKeeper Dail Over Controller with this command. hadoop-daemon.sh start zkfc
On Node 1 NameNode crashed. After we brought up NodeName 1 on node 1 we are still in the active-active NodeName scenario.
Logs from Node Name on Node 1
17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.6.20:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy10.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.4.25:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy10.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9un0c.bnymellon.net/10.61.6.21:8485 from pkimd1m got value #142
17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.6.21:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy10.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9sn0c.bnymellon.net/10.61.4.24:8485 from pkimd1m: starting, having connections 4
17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9sn0c.bnymellon.net/10.61.4.24:8485 from pkimd1m sending #140
17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9sn0c.bnymellon.net/10.61.4.24:8485 from pkimd1m got value #140
17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.4.24:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy10.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/07 16:40:38 FATAL namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.61.6.20:8485, 10.61.4.24:8485, 10.61.4.25:8485, 10.61.6.21:8485], stream=QuorumOutputStream starting at txid 175))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 3/4. 4 exceptions thrown:
10.61.6.21:8485: IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
10.61.4.24:8485: IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
10.61.6.20:8485: IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
10.61.4.25:8485: IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1266)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1203)
at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1300)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:5836)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1122)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)
at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
17/08/07 16:40:38 WARN client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 175
17/08/07 16:40:38 INFO util.ExitUtil: Exiting with status 1
17/08/07 16:40:38 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at r00j9rn0c.bnymellon.net/10.61.6.20
************************************************************/
Created 08-07-2017 09:08 PM
At this point we brought down the journalNode, NameNode, ZookKeeper FailOver Controlle and Zookeeper.
We deleted the tmp files for the journame node and the files foe zoo keeper.
Did the following on node 2
hdfs namenode -bootstrapStandby
staterted name node on node 2
hadoop-daemon.sh start zkfc
start zkfc on node 2
This did not resolve the active-active Name Node issue.