Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

hdfs journalnode fail, can not start

avatar
New Contributor

repeat this log: 4/14, 16:53:28.739 WARN org.apache.hadoop.hdfs.server.namenode.FSImage Caught exception after scanning through 0 ops from /data01/dfs/jn/pt-nameservice/current/edits_inprogress_0000000000182289435 while determining its valid length. Position was 987136 java.io.IOException: Can't scan a pre-transactional edit log. at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4592) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355) at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:551) at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:193) at org.apache.hadoop.hdfs.qjournal.server.Journal.(Journal.java:153) at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:93) at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:102) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.newEpoch(JournalNodeRpcServer.java:136) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.newEpoch(QJournalProtocolServerSideTranslatorPB.java:133) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25417) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

1 ACCEPTED SOLUTION

avatar
Master Collaborator

This check seems to be added from https://issues.apache.org/jira/browse/HDFS-8965

 

/data01/dfs/jn/pt-nameservice/current/edits_inprogress_0000000000182289435 is either corrupted, or perhaps It is a log from before an upgrade?

 

Either way, To get it back I would suggest backing up the /data01/dfs/jn/pt-nameservice/current/ directory somewhere else and copy the journalnode data from one of your other journalnodes to that location.

View solution in original post

10 REPLIES 10

avatar
Master Collaborator

This check seems to be added from https://issues.apache.org/jira/browse/HDFS-8965

 

/data01/dfs/jn/pt-nameservice/current/edits_inprogress_0000000000182289435 is either corrupted, or perhaps It is a log from before an upgrade?

 

Either way, To get it back I would suggest backing up the /data01/dfs/jn/pt-nameservice/current/ directory somewhere else and copy the journalnode data from one of your other journalnodes to that location.

avatar
New Contributor

Hi Ben,


I have configure a cluster HDFS HA using the Quorum Journal Manager with 5 Journal Node, on one of them I receive the same error as BaiTing, If I copy the journal files from a working Journal Node should resolve the issue?

 

Thanks in advance!

avatar
Master Collaborator

Yes, correct, moving out the old, corrupted files and copying in the files from a working Journal Node should allow you to start the journalnode.

avatar
New Contributor

It works thanks!

avatar
New Contributor

I am also facing the same issue but have a question: Do we need to stop HDFS ervice or put it to maintenance/read-only mode before copying the data? Otherwise, there may be data being written to the health journalnode (from which data will be copied).

avatar
Master Collaborator

you shouldn't have to, as when the journalnode starts up, it should recognize that it is behind and then sync up with the other journalnodes. (similar things happen when you stop a journalnode and then restart it later)

avatar
Explorer
You don't have to stop any instance nor service on the cluster.
The error messages stop after 30~60 seconds.

avatar
New Contributor

it works thanks

 

avatar
New Contributor

Hi,ben

it does work for me,thanks!