Created 09-21-2021 09:47 AM
2021-09-21 11:28:04,945 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.241.109.57:8485, 10.241.109.81:8485, 10.241.109.76:8485], stream=null))
java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:197)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:436)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:616)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:385)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:613)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1603)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1210)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1898)
at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1756)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1700)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
2021-09-21 11:28:04,949 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.241.109.57:8485, 10.241.109.81:8485, 10.241.109.76:8485], stream=null))
2021-09-21 11:28:04,952 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ada-3.novalocal/10.241.109.51
************************************************************/
Any ideas on how to solve it?
Created on 09-22-2021 05:28 AM - edited 09-22-2021 05:39 AM
Hi @doncucumber,
From these JN logs we can say edits_inprogress_0000000000010993186 was not the latest edit log, in JN1 the edit_inprogress number is13015981 but the problematic JN2 and JN3 is still 10993186.
You could try following steps:
1. stop whole HDFS service including all NN/JN.
2. On JN2 /JN3 which both has same error, move the fsimage edits directory (/datos3/dfs/jn/nameservice1/current/) to another location, for example, move them to /tmp
3. Copy the good fsimage edits directory (/datos3/dfs/jn/nameservice1/current/) from JN1 to these problematic JN nodes. Now you have manually synced up all the JNs fsimage edits directories.
4. Start HDFS.
Please let us know if this solution could help.
Thanks,
Will
Created 09-22-2021 04:42 AM
Created 09-22-2021 05:11 AM
There are three JN: The first doesn't look that bad.. the other two, really bad:
JN 1:
2021-09-22 09:00:17,953 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 500 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
2021-09-22 09:00:17,963 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8485
2021-09-22 09:00:18,001 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2021-09-22 09:00:18,002 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8485: starting
2021-09-22 09:00:20,121 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: Initializing journal in directory /datos3/dfs/jn/nameservice1
2021-09-22 09:00:20,158 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /datos3/dfs/jn/nameservice1/in_use.lock acquired by nodename 26126@ada-4.novalocal
2021-09-22 09:00:20,179 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Scanning storage FileJournalManager(root=/datos3/dfs/jn/nameservice1)
2021-09-22 09:00:20,331 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Latest log is EditLogFile(file=/datos3/dfs/jn/nameservice1/current/edits_inprogress_0000000000013015981,first=0000000000013015981,last=0000000000013061158,inProgress=true,hasCorruptHeader=false)
2021-09-22 09:00:20,338 ERROR org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer: Could not construct Shared Edits Uri
2021-09-22 09:00:20,338 WARN org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer: Other JournalNode addresses not available. Journal Syncing cannot be done
JN 2:
2021-09-22 09:00:51,400 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /datos3/dfs/jn/nameservice1/current/edits_inprogress_0000000000010993186 while determining its valid length. Position was 1048576
java.io.IOException: Can't scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:5264)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:243)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1248)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:327)
at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:566)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:210)
at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:162)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:99)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:145)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:216)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:228)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:27411)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
2021-09-22 09:00:51,401 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1048576
JN 3:
2021-09-22 09:02:47,518 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /datos3/dfs/jn/nameservice1/current/edits_inprogress_0000000000010993186 while determining its valid length. Position was 1048576
java.io.IOException: Can't scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:5264)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:243)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1248)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:327)
at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:566)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:210)
at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:162)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:99)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:145)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:216)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:228)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:27411)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
2021-09-22 09:02:47,518 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: After resync, position is 1048576
in JN, after that last line, nothing else were logged until I stopped the service.
Created on 09-22-2021 05:28 AM - edited 09-22-2021 05:39 AM
Hi @doncucumber,
From these JN logs we can say edits_inprogress_0000000000010993186 was not the latest edit log, in JN1 the edit_inprogress number is13015981 but the problematic JN2 and JN3 is still 10993186.
You could try following steps:
1. stop whole HDFS service including all NN/JN.
2. On JN2 /JN3 which both has same error, move the fsimage edits directory (/datos3/dfs/jn/nameservice1/current/) to another location, for example, move them to /tmp
3. Copy the good fsimage edits directory (/datos3/dfs/jn/nameservice1/current/) from JN1 to these problematic JN nodes. Now you have manually synced up all the JNs fsimage edits directories.
4. Start HDFS.
Please let us know if this solution could help.
Thanks,
Will
Created 09-22-2021 06:00 AM
As soon as I see the logs you asked for, I realized where the problem was on JN 2 and 3. Yesterday I did the exact procedure between JN 2 and 3 without looking at those logs. If I would, I'd copy files from 1 to 2 and 3 and the problem would be long gone.
That's why you need to sleep. After so many hours working straight on this, I skip that simple and important step.
Thanks for your help.
Created 09-22-2021 06:05 AM
Note: Why JN appears as good and running on CL if it had this problems? Now I'm possitive JN 2 was with problems and as soon as the server that contains JN 3 died, the whole cluster was doomed. If CM would tell JN 2 was wrong time ago, I'd work on it and solve it, so if JN 3 die, nothing would happen to the cluster, right?
Created 09-22-2021 06:30 AM
Hi @doncucumber ,
Now you can have a good rest 🙂
Please check this article for the detailed HA concepts:
https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_hag_hdfs_ha_intro.html
Active NN will sync up edit logs to majority of JournalNodes, so standby NN is capable of reading the edits from JNs. From your NN log we could see the reason of recovery failure is the recovery time exceeds the timeout 120000ms for a quorum of nodes to respond, so that's why I requested JN logs to check if edits sync-up failed and we found JN2 & 3's problem.
Regards,
Will