Reply
Highlighted
Explorer
Posts: 16
Registered: ‎05-22-2016
Accepted Solution

Namenode failovered when a JournalNode restart

dear all.

 

My cluster is configured as below

 

2 NN

3 JN

2 Failovercontroller

20 DN

 

For maintainence reason, I had to restart all component of our hadoop cluster.

When I restarted a JN (this JN is in the same machine of NN), NN failovered.

 

below is the NN's log 

This log says it failed because it failed writing to majority.

I'd like to know why can't write to JN.

Thanks.

2017-02-17 11:18:04,266 INFO BlockStateChange: BLOCK* BlockManager: ask xxx.xxx.xx.192:50010 to delete [blk_1073833801_92991]
2017-02-17 11:18:04,267 INFO BlockStateChange: BLOCK* BlockManager: ask xxx.xxx.xx.191:50010 to delete [blk_1073833801_92991]
2017-02-17 11:18:07,267 INFO BlockStateChange: BLOCK* BlockManager: ask xxx.xxx.xx.189:50010 to delete [blk_1073833801_92991]
2017-02-17 11:18:30,896 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30001 milliseconds
2017-02-17 11:18:30,898 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(         s).
2017-02-17 11:19:00,897 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2017-02-17 11:19:00,898 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 2 millisecond(         s).
2017-02-17 11:19:01,314 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 8 Total time for transactions(ms): 4 Number of transactions batched in Syncs: 2 Number of syncs: 5 SyncTimes(ms): 23 6
2017-02-17 11:19:01,354 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal xxx.xxx.xx.188:8485 failed to write txns 679604-679604. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:485)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:354)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
        at org.apache.hadoop.ipc.Client.call(Client.java:1472)
        at org.apache.hadoop.ipc.Client.call(Client.java:1409)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at com.sun.proxy.$Proxy18.journal(Unknown Source)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2017-02-17 11:19:01,776 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal xxx.xxx.xx.190:8485 failed to write txns 679604-679604. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:485)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:354)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
 
        at org.apache.hadoop.ipc.Client.call(Client.java:1472)
        at org.apache.hadoop.ipc.Client.call(Client.java:1409)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at com.sun.proxy.$Proxy18.journal(Unknown Source)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2017-02-17 11:19:01,779 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [xxx.xxx.xx.188:8485, xxx.xxx.xx.190:8485, xxx.xxx.xx.193:8485], stream=QuorumOutputStream starting at txid 679597))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 1 successful responses:
xxx.xxx.xx.193:8485: null [success]
2 exceptions thrown:
xxx.xxx.xx.190:8485: Can't write, no segment open
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:485)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:354)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
 
xxx.xxx.xx.188:8485: Can't write, no segment open
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:485)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:354)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
 
        at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:651)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:585)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2752)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2624)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:599)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:112)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:401)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java         )
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
2017-02-17 11:19:01,781 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 679597
2017-02-17 11:19:01,788 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-02-17 11:19:01,795 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at abc.db.co/xxx.xxx.xx.190
Explorer
Posts: 16
Registered: ‎05-22-2016

Re: Namenode failovered when a JournalNode restart

I tested this again.

"Restarting JN which is in the same machine of NN occurs NN fail" was my misunderstanding.

 

The cause of problem is below I thnik.

 

When the JN is restarted, the editlog file known to NN's QJM is no longer available (invalid)
You can not record changes to this JN until the edit log is rolled and a new log file is created.
After the JN restart, it takes about 3 minutes for the editlog to roll and transition to a usable state.

When restarting JN, restart one of them and immediately restart the other. If two JNs can not write, they will stop NN.

 

So, If I restarted JN with interval (about 5 minutes), this can not happen.

Announcements