Created 01-14-2018 12:59 PM
Hi,
This happened yesterday and today too. Yesterday there was an error like hive metastore connection failed, but actually the standby namenode shut down which i was able to see after hive restart.... restarted standby and everything was fine. and same thing happened today too.. the error in hdfs namenode log looks like this:
2018-01-14 07:07:00,039 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal x.x.x.x:8485 failed to write txns 641590896-641590896. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 52 is less than the last promised epoch 53 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428) at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:456) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552) at org.apache.hadoop.ipc.Client.call(Client.java:1496) at org.apache.hadoop.ipc.Client.call(Client.java:1396) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy11.journal(Unknown Source) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2018-01-14 07:07:00,039 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal x.x.x.x:8485 failed to write txns 641590896-641590896. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 52 is less than the last promised epoch 53 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428)
.....
2018-01-14 07:07:00,041 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [x.x.x.x.x:8485, x.x.x.x:8485, x.x.x.x:8485], stream=QuorumOutputStream starting at txid 641590896)) org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown: x.x.x.x:8485: IPC's epoch 52 is less than the last promised epoch 53
2018-01-14 07:07:10,063 WARN util.ShutdownHookManager (ShutdownHookManager.java:run(70)) - ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 2018-01-14 07:07:10,063 ERROR hdfs.DFSClient (DFSClient.java:closeAllFilesBeingWritten(950)) - Failed to close inode 119707511 java.io.IOException: Failed to shutdown streamer
2018-01-14 07:07:10,070 INFO provider.BaseAuditHandler (BaseAuditHandler.java:logStatus(312)) - Audit Status Log: name=hdfs.async.batch.hdfs, interval=01:28.634 minutes, events=95, succcessCount=95, totalEvents=2279779, totalSuccessCount=2279779 2018-01-14 07:07:10,077 INFO queue.AuditFileSpool (AuditFileSpool.java:stop(321)) - Stop called, queueName=hdfs.async.batch, consumer=hdfs.async.batch.hdfs 2018-01-14 07:07:10,091 INFO queue.AuditBatchQueue (AuditBatchQueue.java:runLogAudit(362)) - Exiting consumerThread.run() method. name=hdfs.async.batch 2018-01-14 07:07:10,091 INFO queue.AuditFileSpool (AuditFileSpool.java:runLogAudit(867)) - Caught exception in consumer thread. Shutdown might be in progress 2018-01-14 07:07:10,091 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at str1.bi.grn.hcvlny.cv.net/x.x.x.x
Created 01-14-2018 04:37 PM
@PJ
You have less than (nbJournalNodes/2)+1 Journal nodes online.
You need at least 2 working journal nodes to leave safe mode if you have set 3 journal nodes in your cluster
Explanation on Journal nodes quorum : https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.ht...
Created 01-14-2018 08:21 PM
Created 01-15-2018 02:37 AM
@Benoit Rousseau @Geoffrey Shelton Okot
yes i have 3 working journal nodes, but still this issue is recurring, never happened before
I read about the IPC epoch time earlier and read and checked the link too, but i am not really sure how to resolve this issue? I do not see anything in gc logs either, there seems to be no network issue also.
Created 01-15-2018 05:50 AM
@PJ
This might be due to io issue on JN host "Remote journal x.x.x.x:8485". Is it always the same JN which is lagging at failure? If so you should check IO load on this machine using iotop for instance.
I can also be the result of a very large amount of transactions. What is the value of dfs.namenode.accesstime .precision?