Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NameNoedStanby shutdown by itself when journal quorum sync log

avatar
Explorer

 This the log

 

 

2017-05-25 14:40:37,470 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: .79:50010 is added to blk_3006468944_1934832460{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-4a898333-0a0f-4197-944e-e2f499645e4c:NORMAL: .55:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-67e857f6-31e6-4d16-b079-7eac06de1f22:NORMAL: .79:50010|FINALIZED]]} size 0
2017-05-25 14:40:37,740 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: replaying edit log: 137515/140868 transactions completed. (98%)
2017-05-25 14:40:37,766 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file http:// .local:8480/getJournal?jid=nameservice1&segmentTxId=20804872533&storageInfo=-60%3A78118620%3A1469591854075%3Acluster24, http://XXX.XXX:8480/getJournal?jid=nameservice1&segmentTxId=20804872533&storageInfo=-60%3A78118620%3..., http://XXX.XXX:8480/getJournal?jid=nameservice1&segmentTxId=20804872533&storageInfo=-60%3A78118620%3... of size 28215334 edits # 140868 loaded in 3 seconds
2017-05-25 14:41:18,563 INFO org.apache.hadoop.http.HttpServer2: Process Thread Dump: jsp requested
292 active threads
Thread 261120 (492304899@qtp-1369470652-3169):
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Thread 261102 (Logger channel (from single-thread executor) to XXX.XXX/ .4:8485):
State: WAITING
Blocked count: 8
Waited count: 10
Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5632f864
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat(DatanodeProtocolServerSideTranslatorPB.java:114)
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:29186)
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:415)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
Thread 297 (IPC Server handler 106 on 8022):
State: WAITING
Blocked count: 5299
Waited count: 2047289
Waiting on java.util.concurrent.locks.ReentrantReadWriteLock$FairSync@35deedf3
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:964)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)
Thread 2 (Reference Handler):
State: WAITING
Blocked count: 109242
Waited count: 108895
Waiting on java.lang.ref.Reference$Lock@5f243f31
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
Thread 1 (main):
State: WAITING
Blocked count: 9
Waited count: 6
Waiting on org.apache.hadoop.ipc.ProtobufRpcEngine$Server@20487a6c
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
org.apache.hadoop.ipc.Server.join(Server.java:2508)
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.join(NameNodeRpcServer.java:443)
org.apache.hadoop.hdfs.server.namenode.NameNode.join(NameNode.java:889)
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1613)

2017-05-25 14:41:27,207 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Loaded 140868 edits starting from txid 20804872532
2017-05-25 14:41:27,207 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all datandoes as stale
2017-05-25 14:41:27,208 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Processing 15 messages from DataNodes that were previously queued during standby state
2017-05-25 14:41:27,208 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Received an RBW replica for blk_3005346895_1933976855 on .124:50010: ignoring it, since it is complete with the same genstamp
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_2995704659 added as corrupt on .148:50010 by / because block is COMPLETE and reported length 0 does not match length in block map 155346
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK* invalidateBlock: blk_2995704659_1924067671(same as stored) on .148:50010
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK* invalidateBlocks: postponing invalidation of blk_2995704659_1924067671(same as stored) on .148:50010 because 4 replica(s) are located on nodes with potentially out-of-date block reports
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_2995704823 added as corrupt on .148:50010 by / because block is COMPLETE and reported length 0 does not match length in block map 257707
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK* invalidateBlock: blk_2995704823_1924067835(same as stored) on .148:50010
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK* invalidateBlocks: postponing invalidation of blk_2995704823_1924067835(same as stored) on .148:50010 because 4 replica(s) are located on nodes with potentially out-of-date block reports
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_2995704820 added as corrupt on .148:50010 by / because block is COMPLETE and reported length 0 does not match length in block map 129686
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK* invalidateBlock: blk_2995704820_1924067832(same as stored) on .148:50010
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK* invalidateBlocks: postponing invalidation of blk_2995704820_1924067832(same as stored) on .148:50010 because 4 replica(s) are located on nodes with potentially out-of-date block reports
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_2995704331 added as corrupt on .148:50010 by / because block is COMPLETE and reported length 0 does not match length in block map 3027
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK* invalidateBlock: blk_2995704331_1924067343(same as stored) on .148:50010
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK* invalidateBlocks: postponing invalidation of blk_2995704331_1924067343(same as stored) on .148:50010 because 4 replica(s) are located on nodes with potentially out-of-date block reports
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_2995704571 added as corrupt on .148:50010 by / because block is COMPLETE and reported length 0 does not match length in block map 2969
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK* invalidateBlock: blk_2995704571_1924067583(same as stored) on .148:50010
2017-05-25 14:41:27,208 INFO BlockStateChange: BLOCK* invalidateBlocks: postponing invalidation of blk_2995704571_1924067583(same as stored) on .148:50010 because 4 replica(s) are located on nodes with potentially out-of-date block reports
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_2995704519 added as corrupt on .148:50010 by / because block is COMPLETE and reported length 0 does not match length in block map 3103
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK* invalidateBlock: blk_2995704519_1924067531(same as stored) on .148:50010
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK* invalidateBlocks: postponing invalidation of blk_2995704519_1924067531(same as stored) on .148:50010 because 4 replica(s) are located on nodes with potentially out-of-date block reports
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_2995704285 added as corrupt on .148:50010 by / because block is COMPLETE and reported length 0 does not match length in block map 3349
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK* invalidateBlock: blk_2995704285_1924067297(same as stored) on .148:50010
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK* invalidateBlocks: postponing invalidation of blk_2995704285_1924067297(same as stored) on .148:50010 because 4 replica(s) are located on nodes with potentially out-of-date block reports
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_2995703869 added as corrupt on .148:50010 by / because block is COMPLETE and reported length 0 does not match length in block map 518573
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK* invalidateBlock: blk_2995703869_1924066881(same as stored) on .148:50010
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK* invalidateBlocks: postponing invalidation of blk_2995703869_1924066881(same as stored) on .148:50010 because 4 replica(s) are located on nodes with potentially out-of-date block reports
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_2995703867 added as corrupt on .148:50010 by / because block is COMPLETE and reported length 0 does not match length in block map 36769
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK* invalidateBlock: blk_2995703867_1924066879(same as stored) on .148:50010
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK* invalidateBlocks: postponing invalidation of blk_2995703867_1924066879(same as stored) on .148:50010 because 4 replica(s) are located on nodes with potentially out-of-date block reports
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_2995701937 added as corrupt on .148:50010 by / because reported RWR replica with genstamp 1924064949 does not match COMPLETE block's genstamp in block map 1924307850
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK* invalidateBlock: blk_2995701937_1924064949(stored=blk_2995701937_1924307850) on .148:50010
2017-05-25 14:41:27,209 INFO BlockStateChange: BLOCK* invalidateBlocks: postponing invalidation of blk_2995701937_1924064949(stored=blk_2995701937_1924307850) on .148:50010 because 4 replica(s) are located on nodes with potentially out-of-date block reports
2017-05-25 14:41:27,209 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Received an RBW replica for blk_3005532032_1934102219 on .73:50010: ignoring it, since it is complete with the same genstamp
2017-05-25 14:41:27,209 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Received an RBW replica for blk_3005613028_1934020244 on .124:50010: ignoring it, since it is complete with the same genstamp
2017-05-25 14:41:27,209 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Received an RBW replica for blk_3005491542_1933974671 on .128:50010: ignoring it, since it is complete with the same genstamp
2017-05-25 14:41:27,209 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Received an RBW replica for blk_3005491542_1933974671 on .93:50010: ignoring it, since it is complete with the same genstamp
2017-05-25 14:41:27,209 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication and invalidation queues
2017-05-25 14:41:27,209 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: initializing replication queues
2017-05-25 14:41:27,209 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing edit logs at txnid 20805013401
2017-05-25 14:41:27,305 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 20805013401
2017-05-25 14:41:27,346 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 20805013401 failed for required journal (JournalAndStream(mgr=QJM to [ , .4:8485, .5:8485, .6:8485, .14:8485], stream=null))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 3/5. 5 exceptions thrown:
: IPC's epoch 46 is less than the last promised epoch 47
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429)
at org.apache.hadoop.hdfs.qjournal.server.Journal.startLogSegment(Journal.java:513)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.startLogSegment(JournalNodeRpcServer.java:162)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.startLogSegment(QJournalProtocolServerSideTranslatorPB.java:198)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25425)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

.14:8485: IPC's epoch 46 is less than the last promised epoch 47
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429)
at org.apache.hadoop.hdfs.qjournal.server.Journal.startLogSegment(Journal.java:513)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.startLogSegment(JournalNodeRpcServer.java:162)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.startLogSegment(QJournalProtocolServerSideTranslatorPB.java:198)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25425)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

.5:8485: IPC's epoch 46 is less than the last promised epoch 47
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429)
at org.apache.hadoop.hdfs.qjournal.server.Journal.startLogSegment(Journal.java:513)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.startLogSegment(JournalNodeRpcServer.java:162)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.startLogSegment(QJournalProtocolServerSideTranslatorPB.java:198)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25425)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

.4:8485: IPC's epoch 46 is less than the last promised epoch 47
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429)
at org.apache.hadoop.hdfs.qjournal.server.Journal.startLogSegment(Journal.java:513)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.startLogSegment(JournalNodeRpcServer.java:162)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.startLogSegment(QJournalProtocolServerSideTranslatorPB.java:198)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25425)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

.6:8485: IPC's epoch 46 is less than the last promised epoch 47
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429)
at org.apache.hadoop.hdfs.qjournal.server.Journal.startLogSegment(Journal.java:513)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.startLogSegment(JournalNodeRpcServer.java:162)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.startLogSegment(QJournalProtocolServerSideTranslatorPB.java:198)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25425)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1206)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:316)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1248)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1767)
at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1640)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1375)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
2017-05-25 14:41:27,348 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-05-25 14:41:27,375 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

 

 


/usr/java/jdk1.7.0_55/bin/java -Dproc_namenode -Xmx1000m -Dhdfs.audit.logger=INFO,RFAAUDIT -Dsecurity.audit.logger=INFO,RFAS -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/agoda/log/hadoop-hdfs -Dhadoop.log.file=hadoop-cmf-hdfs-NAMENODE- .local.log.out -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xms137438953472 -Xmx137438953472 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=8004 -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.NameNode

7 REPLIES 7

avatar
Champion

This post covers it pretty well.

 

The short is that your cluster entered an inconsistent state for NN HA due to some other issue, which the post details.

 

https://community.hortonworks.com/questions/41255/how-to-debug-the-issue-ipcs-epoch-x-is-less-than-t...

avatar
Expert Contributor

It looks like an issue described in https://issues.apache.org/jira/browse/HDFS-11254 (

Standby NameNode may crash during failover if loading edits takes too long)

 

2017-05-25 14:40:37,740 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: replaying edit log: 137515/140868 transactions completed. (98%)

 

2017-05-25 14:41:27,207 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Loaded 140868 edits starting from txid 20804872532

 

It took 50 seconds to load edits. Edit log loading must acquire namenode lock and ZKFC may fail to be establish connection with NameNode.

 

at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1640)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1375)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

 

Looks like it happened during transition from standby to active.

 

It may be fixed by HDFS-8865 Improve quota initialization performance

I suspect the stackdump in the log is not complete. If it's induced by HDFS-8865, you would see stacktrace like:

Thread 188 (IPC Server handler 25 on 8022):
  State: RUNNABLE
  Blocked count: 278
  Waited count: 17419
  Stack:
    org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:886)
    org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
    org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
    org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
    org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
    org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
    org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
    org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
    org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
    org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuota(FSImage.java:875)
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:860)
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:827)
    org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
    org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:188)
    org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:182)
    java.security.AccessController.doPrivileged(Native Method)
    javax.security.auth.Subject.doAs(Subject.java:415)
    org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
    org.apache.hadoop.security.SecurityUtil.doAsUser(SecurityUtil.java:477)
    org.apache.hadoop.security.SecurityUtil.doAsLoginUser(SecurityUtil.java:458)

 

A workaround is to increase ZKFC connection timeout value. The default is 45 seconds IIRC. Double this number should alleviate the problem.

avatar
Explorer

Thank you, I will try it 

avatar
Champion

I belive this zookeeper timeout property  should be put inside conf/zoo.cfg 

correct me if I am wrong 

also can you tell me which one of these property be doubled to avoid . I am newbie when it comes to zookeeper 

tickTime=2000
initLimit=5
syncLimit=2

avatar
Champion
@csguna The ZKFC is the NN HA Failover Controller that uses ZK for fencing. The timeout would be for the ZK client within ZKFC.

In my experience I have found that I need to set all ZK clients timeouts to 30 seconds, up from the default of 15 seconds. I believe that future releases have 30 seconds as the default anyway.

I am having trouble finding the exact setting name. I'll keep looking and come back when I find it, if someone else hasn't first.

avatar
Expert Contributor

The following is the HDFS NameNode configurations that can be updated to alleviate the issue:

 

<property>
  <name>ha.health-monitor.rpc-timeout.ms</name>
  <value>45000</value>
  <description>
    Timeout for the actual monitorHealth() calls.
  </description>
</property>

 I suggest bump ha.health-monitor.rpc-timeout.ms from 45000 (milliseconds) to 90000 and see if it helps.

avatar
Explorer

Thank you, I gonna verify it.