Created 07-21-2023 04:49 AM
Hello All,
I'm trouble-shooting the following issue with our Cloudera Nutch cluster and would appreciate any help the community can offer:
We have two NameNode roles and three JournalNode roles running, however both NameNode roles are failing to start and reporting the error below (IP addresses obfuscated). This occurred following a restart of the underlying hosts.
Any recommendations for a recovery path from this error would be greatly appreciated.
Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [x.x.x.95:8485, x.x.x.86:8485, x.x.x.130:8485], stream=null))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 1 successful responses:
x.x.x.130:8485: null [success]
2 exceptions thrown:
10.103.28.95:8485: tried to access method com.google.common.collect.Range.<init>(Lcom/google/common/collect/Cut;Lcom/google/common/collect/Cut;)V from class com.google.common.collect.Ranges
at com.google.common.collect.Ranges.create(Ranges.java:76)
at com.google.common.collect.Ranges.closed(Ranges.java:98)
at org.apache.hadoop.hdfs.qjournal.server.Journal.txnRange(Journal.java:872)
at org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:806)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:206)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:261)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
10.103.28.86:8485: tried to access method com.google.common.collect.Range.<init>(Lcom/google/common/collect/Cut;Lcom/google/common/collect/Cut;)V from class com.google.common.collect.Ranges
at com.google.common.collect.Ranges.create(Ranges.java:76)
at com.google.common.collect.Ranges.closed(Ranges.java:98)
at org.apache.hadoop.hdfs.qjournal.server.Journal.txnRange(Journal.java:872)
at org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:806)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:206)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:261)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnclosedSegment(QuorumJournalManager.java:345)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:455)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:621)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1408)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1201)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1717)
at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1590)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1351)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
Created 07-21-2023 10:25 AM
@idodds Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our HDFS experts @blizano and @pajoshi who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created 07-21-2023 10:52 AM
Hello @idodds ,
Your Namenode is failing to connect to quorum of JN (2/3).
Could you check and share any errors/warn you are getting on the two remote JN hosts ?
Thank you
Parth Joshi
Created 07-21-2023 11:08 AM
Hi. Thank you for responding. Replying on behalf of @idodds. Both of the nodes report same/similar errors as below:
Jul 21, 8:33:30.310 AM INFO org.apache.hadoop.hdfs.qjournal.server.Journal
Updating lastPromisedEpoch from 172 to 173 for client /x.y.z.30
Jul 21, 8:33:30.312 AM INFO org.apache.hadoop.hdfs.qjournal.server.Journal
Scanning storage FileJournalManager(root=/dfs/journal-edits/nutch-nameservice1)
Jul 21, 8:33:30.329 AM INFO org.apache.hadoop.hdfs.qjournal.server.Journal
Latest log is EditLogFile(file=/dfs/journal-edits/nutch-nameservice1/current/edits_inprogress_0000000000256541217,first=0000000000256541217,last=0000000000256541842,inProgress=true,hasCorruptHeader=false)
Jul 21, 8:33:30.339 AM INFO org.apache.hadoop.hdfs.qjournal.server.Journal
getSegmentInfo(256541217): EditLogFile(file=/dfs/journal-edits/nutch-nameservice1/current/edits_inprogress_0000000000256541217,first=0000000000256541217,last=0000000000256541842,inProgress=true,hasCorruptHeader=false) -> startTxId: 256541217 endTxId: 256541842 isInProgress: true
Jul 21, 8:33:30.340 AM INFO org.apache.hadoop.hdfs.qjournal.server.Journal
Prepared recovery for segment 256541217: segmentState { startTxId: 256541217 endTxId: 256541842 isInProgress: true } lastWriterEpoch: 38 lastCommittedTxId: 256541843
Jul 21, 8:33:30.358 AM INFO org.apache.hadoop.hdfs.qjournal.server.Journal
getSegmentInfo(256541217): EditLogFile(file=/dfs/journal-edits/nutch-nameservice1/current/edits_inprogress_0000000000256541217,first=0000000000256541217,last=0000000000256541842,inProgress=true,hasCorruptHeader=false) -> startTxId: 256541217 endTxId: 256541842 isInProgress: true
Jul 21, 8:33:30.358 AM INFO org.apache.hadoop.hdfs.qjournal.server.Journal
Synchronizing log startTxId: 256541217 endTxId: 256541843 isInProgress: true: old segment startTxId: 256541217 endTxId: 256541842 isInProgress: true is not the right length
Jul 21, 8:33:30.358 AM WARN org.apache.hadoop.ipc.Server
IPC Server handler 1 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.acceptRecovery from x.y.z.30:37022 Call#17 Retry#0
java.lang.IllegalAccessError: tried to access method com.google.common.collect.Range.<init>(Lcom/google/common/collect/Cut;Lcom/google/common/collect/Cut;)V from class com.google.common.collect.Ranges
at com.google.common.collect.Ranges.create(Ranges.java:76)
at com.google.common.collect.Ranges.closed(Ranges.java:98)
at org.apache.hadoop.hdfs.qjournal.server.Journal.txnRange(Journal.java:872)
at org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:806)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:206)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:261)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)