Member since
08-07-2017
5
Posts
1
Kudos Received
0
Solutions
08-07-2017
09:08 PM
At this point we brought down the journalNode, NameNode, ZookKeeper FailOver Controlle and Zookeeper. We deleted the tmp files for the journame node and the files foe zoo keeper. Did the following on node 2 hdfs namenode -bootstrapStandby staterted name node on node 2 hadoop-daemon.sh start zkfc start zkfc on node 2 This did not resolve the active-active Name Node issue.
... View more
08-07-2017
08:54 PM
When we tried to start ZooKeeper Fail Over Controller. we got this error on node 1. 17/08/07 16:24:41 DEBUG zookeeper.ClientCnxn: Reading reply sessionid:0x15dbd57ff59000c, packet:: clientPath:null serverPath:null finished:false header:: 1,3 replyHeader:: 1,35,-101 request:: '/hadoop-ha/ha-cluster,F response:: 17/08/07 16:24:41 FATAL ha.ZKFailoverController: Unable to start failover controller. Parent znode does not exist. Run with -formatZK flag to initialize ZooKeeper. We were starting/stopping NameNode and ZooKeeper Failover controller on the other node (node 2) to resolve an Active-active NameNode scenario. For some unexplained reason the ZooKeeper Fail Over Controller crashed on node 1. We reformated ZooKeeper Fail Over Controller on node 1. restarted NameNode and ZooKeeper Fail Over Controller crashed on node 2. At this point we still had the active-active namenode issue. On node 2 we reformated the filesystem by using this command hdfs namenode -bootstrapStandby On node 2 we reformated ZooKeeper Dail Over Controller with this command. hadoop-daemon.sh start zkfc On Node 1 NameNode crashed. After we brought up NodeName 1 on node 1 we are still in the active-active NodeName scenario. Logs from Node Name on Node 1 17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.6.20:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy10.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.4.25:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy10.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9un0c.bnymellon.net/10.61.6.21:8485 from pkimd1m got value #142
17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.6.21:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy10.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9sn0c.bnymellon.net/10.61.4.24:8485 from pkimd1m: starting, having connections 4
17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9sn0c.bnymellon.net/10.61.4.24:8485 from pkimd1m sending #140
17/08/07 16:40:38 DEBUG ipc.Client: IPC Client (583767494) connection to r00j9sn0c.bnymellon.net/10.61.4.24:8485 from pkimd1m got value #140
17/08/07 16:40:38 WARN client.QuorumJournalManager: Remote journal 10.61.4.24:8485 failed to write txns 176-176. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy10.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/07 16:40:38 FATAL namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.61.6.20:8485, 10.61.4.24:8485, 10.61.4.25:8485, 10.61.6.21:8485], stream=QuorumOutputStream starting at txid 175))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 3/4. 4 exceptions thrown:
10.61.6.21:8485: IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
10.61.4.24:8485: IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
10.61.6.20:8485: IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
10.61.4.25:8485: IPC's epoch 23 is less than the last promised epoch 25
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:418)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:446)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:341)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1266)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1203)
at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1300)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:5836)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1122)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)
at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
17/08/07 16:40:38 WARN client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 175
17/08/07 16:40:38 INFO util.ExitUtil: Exiting with status 1
17/08/07 16:40:38 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at r00j9rn0c.bnymellon.net/10.61.6.20
************************************************************/
... View more
08-07-2017
08:22 PM
We see the following FATAL error messages in the ZooKeeper Failover controller Logs. 17/08/07 16:13:38 FATAL ha.ActiveStandbyElector: Received create error from Zookeeper. code:NONODE for path /hadoop-ha/ha-cluster/ActiveStandbyElectorLock
17/08/07 16:13:38 DEBUG ha.ActiveStandbyElector: Terminating ZK connection for elector id=1435256152 appData=0a0a68612d636c757374657212036e6e311a177230306a39726e30632e626e796d656c6c6f6e2e6e657420a84628d33e cb=Elector callbacks for NameNode at r00j9rn0c.bnymellon.net/10.61.6.20:9000
17/08/07 16:13:38 DEBUG zookeeper.ZooKeeper: Closing session: 0x15dbd57822b000f
17/08/07 16:13:38 DEBUG zookeeper.ClientCnxn: Closing client for session: 0x15dbd57822b000f
17/08/07 16:13:38 DEBUG zookeeper.ClientCnxn: Reading reply sessionid:0x15dbd57822b000f, packet:: clientPath:null serverPath:null finished:false header:: 6,-11 replyHeader:: 6,73,0 request:: null response:: null
17/08/07 16:13:38 DEBUG zookeeper.ClientCnxn: Disconnecting client for session: 0x15dbd57822b000f
17/08/07 16:13:38 INFO zookeeper.ZooKeeper: Session: 0x15dbd57822b000f closed
17/08/07 16:13:38 FATAL ha.ZKFailoverController: Fatal error occurred:Received create error from Zookeeper. code:NONODE for path /hadoop-ha/ha-cluster/ActiveStandbyElectorLock
17/08/07 16:13:38 INFO zookeeper.ClientCnxn: EventThread shut down
17/08/07 16:13:38 INFO ipc.Server: Stopping server on 8019
17/08/07 16:13:38 DEBUG ipc.Server: IPC Server handler 0 on 8019: exiting
17/08/07 16:13:38 DEBUG ipc.Server: IPC Server handler 2 on 8019: exiting
17/08/07 16:13:38 DEBUG ipc.Server: IPC Server handler 1 on 8019: exiting
17/08/07 16:13:38 INFO ha.ActiveStandbyElector: Yielding from election
17/08/07 16:13:38 INFO ipc.Server: Stopping IPC Server listener on 8019
17/08/07 16:13:38 INFO ipc.Server: Stopping IPC Server Responder
17/08/07 16:13:38 INFO ha.HealthMonitor: Stopping HealthMonitor thread
17/08/07 16:13:38 FATAL tools.DFSZKFailoverController: Got a fatal error, exiting now
java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:NONODE for path /hadoop-ha/ha-cluster/ActiveStandbyElectorLock
at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:369)
at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:238)
at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
... View more
08-07-2017
08:02 PM
We are setting up a Hadoop Cluster in our development environment. While we were testing the fail over of the NameNode we noticed that the Zookeeper Failover controller would sometimes crash. In one case both ZooKeeper and the ZooKeeper Failover controller crashed. At one point in our testing both NameNodes were in an active state. This would cause a split brain scenario in the Hadoop Cluster. We have not seen any useful information in the logs. We are using the following versions: - hadoop-2.7.3 - zookeeper-3.4.10 We have two a four server cluster. Two of the servers are dedicated to NameNode and two of the servers are dedicated to DataNodes. The components running on the NameNode servers are - NameNode - ZooKeeper - ZooKeeper Failover controller - JournalNode The components running on the DataNode servers are - DataNode - ZooKeeper - JournalNode The following matrix contains the test scenarios. After the matrix we have the contents of the core-site.xml and hdfs-site.xml.
NameNode
Server 1
NameNode Server
2
Name node 1 is
active
NameNode 2 is standby
Kill -9 pid NameNode 1
Name Node 1 is down
NameNode 2 is active
Start NameNode 1
Zoo Keeper Failover 1 crashes
NameNode 1 is standby
NameNode 2 is active
Start Zoo Keeper Failover Controller1
NameNode 1 is active
NameNode 2 is standby
Kill -9 pid NameNode 1
Name Node 1 is down
NameNode 2 is active
Start NameNode 1
Zoo Keeper Failover 1 crashes
NameNode 1 is standby
No useful information in the Zoo Keeper
Failover Controller Logs
NameNode 2 is Aactive
Turn on Log4j debugging
Start Zoo Keeper Failover Controller1
Zoo Keeper Fail Over Controller does
not start
NameNode 2 is Aactive
Turn on Log4j debugging,console
Start Zoo Keeper Failover Controller1
Zoo Keeper Fail Over Controller does
not start
Logs: unable to start failover controller Parent znode does not exist
Logs: run with -formatZK to initalize Zookeeper
NameNode 2 is active
run: hdfs zkfc -formatZK
Start Zoo Keeper Failover Controller 1
Name node 1 is active
NameNode
2 is active
Name node 1 is active
Stop Name Node 2
Zoo Keeper Failover Controller 2 crashed
Zoo Keeper crashed
NameNode 1 is
standby
Start
Zoo Keeper
Start Zoo Keeper Failover Controller 2
Start Name Node 2
Name Node 2 active core-site.xml <configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/xpm/hadoop_tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ha-cluster</value>
</property>
<property>
<name>dfs.jornalnode.edits.dir</name>
<value>/opt/xpm/hadoop_journal</value>
</property>
</configuration>
hdfs-site.xml <configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/xpm/hadoop_namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ha-cluster</value>
</property>
<property>
<name>dfs.ha.namenodes.ha-cluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ha-cluster.nn1</name>
<value>r00j9rn0c.bnymellon.net:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ha-cluster.nn2</name>
<value>r00j9sn0c.bnymellon.net:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.ha-cluster.nn1</name>
<value>r00j9rn0c.bnymellon.net:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.ha-cluster.nn2</name>
<value>r00j9sn0c.bnymellon.net:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://r00j9rn0c.bnymellon.net:8485;r00j9sn0c.bnymellon.net;r00j9tn0c.bnymellon.net:8485;r00j9un0c.bnymellon.net:8485/ha-cluster</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.ha-cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>r00j9rn0c.bnymellon.net:2181,r00j9sn0c.bnymellon.net:2181,r00j9tn0c.bnymellon.net:2181,r00j9un0c.bnymellon.net:2181</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/users/home/pkimd1m/.ssh/id_rsa</value>
</property>
</configuration>
... View more
Labels:
- Labels:
-
Apache Hadoop