Support Questions

Find answers, ask questions, and share your expertise

Urgent! one of the namenode shuts off in HA, no jobs running

avatar
Expert Contributor

Hi,

This happened yesterday and today too. Yesterday there was an error like hive metastore connection failed, but actually the standby namenode shut down which i was able to see after hive restart.... restarted standby and everything was fine. and same thing happened today too.. the error in hdfs namenode log looks like this:

2018-01-14 07:07:00,039 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal x.x.x.x:8485 failed to write txns 641590896-641590896. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 52 is less than the last promised epoch 53 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:456) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552) at org.apache.hadoop.ipc.Client.call(Client.java:1496) at org.apache.hadoop.ipc.Client.call(Client.java:1396) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy11.journal(Unknown Source) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2018-01-14 07:07:00,039 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal x.x.x.x:8485 failed to write txns 641590896-641590896. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 52 is less than the last promised epoch 53 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428)

.....

2018-01-14 07:07:00,041 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [x.x.x.x.x:8485, x.x.x.x:8485, x.x.x.x:8485], stream=QuorumOutputStream starting at txid 641590896)) org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown: x.x.x.x:8485: IPC's epoch 52 is less than the last promised epoch 53

2018-01-14 07:07:10,063 WARN util.ShutdownHookManager (ShutdownHookManager.java:run(70)) - ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 2018-01-14 07:07:10,063 ERROR hdfs.DFSClient (DFSClient.java:closeAllFilesBeingWritten(950)) - Failed to close inode 119707511 java.io.IOException: Failed to shutdown streamer

2018-01-14 07:07:10,070 INFO provider.BaseAuditHandler (BaseAuditHandler.java:logStatus(312)) - Audit Status Log: name=hdfs.async.batch.hdfs, interval=01:28.634 minutes, events=95, succcessCount=95, totalEvents=2279779, totalSuccessCount=2279779 2018-01-14 07:07:10,077 INFO queue.AuditFileSpool (AuditFileSpool.java:stop(321)) - Stop called, queueName=hdfs.async.batch, consumer=hdfs.async.batch.hdfs 2018-01-14 07:07:10,091 INFO queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(362)) - Exiting consumerThread.run() method. name=hdfs.async.batch 2018-01-14 07:07:10,091 INFO queue.AuditFileSpool (AuditFileSpool.java:runLogAudit(867)) - Caught exception in consumer thread. Shutdown might be in progress 2018-01-14 07:07:10,091 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at str1.bi.grn.hcvlny.cv.net/x.x.x.x

4 REPLIES 4

avatar
Explorer

@PJ

You have less than (nbJournalNodes/2)+1 Journal nodes online.

You need at least 2 working journal nodes to leave safe mode if you have set 3 journal nodes in your cluster

Explanation on Journal nodes quorum : https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.ht...

avatar
Master Mentor

@PJ

Here is your response . The issue is really "IPC's epoch 52 is less than the last promised epoch 53" and to understand that see this

avatar
Expert Contributor

@Benoit Rousseau @Geoffrey Shelton Okot

yes i have 3 working journal nodes, but still this issue is recurring, never happened before

I read about the IPC epoch time earlier and read and checked the link too, but i am not really sure how to resolve this issue? I do not see anything in gc logs either, there seems to be no network issue also.

avatar
Explorer

@PJ

This might be due to io issue on JN host "Remote journal x.x.x.x:8485". Is it always the same JN which is lagging at failure? If so you should check IO load on this machine using iotop for instance.

I can also be the result of a very large amount of transactions. What is the value of dfs.namenode.accesstime .precision?