Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Namenode failovered when a JournalNode restart

avatar
Contributor

dear all.

 

My cluster is configured as below

 

2 NN

3 JN

2 Failovercontroller

20 DN

 

For maintainence reason, I had to restart all component of our hadoop cluster.

When I restarted a JN (this JN is in the same machine of NN), NN failovered.

 

below is the NN's log 

This log says it failed because it failed writing to majority.

I'd like to know why can't write to JN.

Thanks.

2017-02-17 11:18:04,266 INFO BlockStateChange: BLOCK* BlockManager: ask xxx.xxx.xx.192:50010 to delete [blk_1073833801_92991]
2017-02-17 11:18:04,267 INFO BlockStateChange: BLOCK* BlockManager: ask xxx.xxx.xx.191:50010 to delete [blk_1073833801_92991]
2017-02-17 11:18:07,267 INFO BlockStateChange: BLOCK* BlockManager: ask xxx.xxx.xx.189:50010 to delete [blk_1073833801_92991]
2017-02-17 11:18:30,896 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30001 milliseconds
2017-02-17 11:18:30,898 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(         s).
2017-02-17 11:19:00,897 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2017-02-17 11:19:00,898 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 2 millisecond(         s).
2017-02-17 11:19:01,314 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 8 Total time for transactions(ms): 4 Number of transactions batched in Syncs: 2 Number of syncs: 5 SyncTimes(ms): 23 6
2017-02-17 11:19:01,354 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal xxx.xxx.xx.188:8485 failed to write txns 679604-679604. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:485)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:354)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
        at org.apache.hadoop.ipc.Client.call(Client.java:1472)
        at org.apache.hadoop.ipc.Client.call(Client.java:1409)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at com.sun.proxy.$Proxy18.journal(Unknown Source)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2017-02-17 11:19:01,776 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal xxx.xxx.xx.190:8485 failed to write txns 679604-679604. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:485)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:354)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
 
        at org.apache.hadoop.ipc.Client.call(Client.java:1472)
        at org.apache.hadoop.ipc.Client.call(Client.java:1409)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at com.sun.proxy.$Proxy18.journal(Unknown Source)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2017-02-17 11:19:01,779 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [xxx.xxx.xx.188:8485, xxx.xxx.xx.190:8485, xxx.xxx.xx.193:8485], stream=QuorumOutputStream starting at txid 679597))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 1 successful responses:
xxx.xxx.xx.193:8485: null [success]
2 exceptions thrown:
xxx.xxx.xx.190:8485: Can't write, no segment open
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:485)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:354)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
 
xxx.xxx.xx.188:8485: Can't write, no segment open
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:485)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:354)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
 
        at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:651)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:585)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2752)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2624)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:599)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:112)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:401)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java         )
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
2017-02-17 11:19:01,781 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 679597
2017-02-17 11:19:01,788 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-02-17 11:19:01,795 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at abc.db.co/xxx.xxx.xx.190
1 ACCEPTED SOLUTION

avatar
Contributor

I tested this again.

"Restarting JN which is in the same machine of NN occurs NN fail" was my misunderstanding.

 

The cause of problem is below I thnik.

 

When the JN is restarted, the editlog file known to NN's QJM is no longer available (invalid)
You can not record changes to this JN until the edit log is rolled and a new log file is created.
After the JN restart, it takes about 3 minutes for the editlog to roll and transition to a usable state.

When restarting JN, restart one of them and immediately restart the other. If two JNs can not write, they will stop NN.

 

So, If I restarted JN with interval (about 5 minutes), this can not happen.

View solution in original post

1 REPLY 1

avatar
Contributor

I tested this again.

"Restarting JN which is in the same machine of NN occurs NN fail" was my misunderstanding.

 

The cause of problem is below I thnik.

 

When the JN is restarted, the editlog file known to NN's QJM is no longer available (invalid)
You can not record changes to this JN until the edit log is rolled and a new log file is created.
After the JN restart, it takes about 3 minutes for the editlog to roll and transition to a usable state.

When restarting JN, restart one of them and immediately restart the other. If two JNs can not write, they will stop NN.

 

So, If I restarted JN with interval (about 5 minutes), this can not happen.