Name Node going down due to QJM timeout

sgowda — Wed, 03 Aug 2016 13:45:41 GMT

I have a cluster with 10 nodes and each node having 2 TB diskspace and 250GB RAM. While writing 1TB data, namenode goes down [ HA NameNode ] with below error. I have ran this multiple time and everytime, it is the same issue.

016-08-03 05:56:43,002 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 8783ms to send a batch of 4 edits (711 bytes) to remote journal 172.27.27.0:8485

2016-08-03 05:56:43,005 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 172.27.29.0:8485 failed to write txns 330736-330807. Will try to write to this JN again after the next log roll.

org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 33 is less than the last promised epoch 34

at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428)

at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:456)

at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351)

at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152)

at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)

at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)

Re: Name Node going down due to QJM timeout

sgowda — Thu, 04 Aug 2016 13:58:41 GMT

So this is what I did , since the datanode and zookeeper was writing to the same disk, the zookeeper writes was slowing down, due to which all the services dependent on zookeeper was going down.

Soln: Brought down the datanode's on the zookeeper machines and started the job -- This has solved the problem for now.

question Name Node going down due to QJM timeout in Archives of Support Questions (Read Only)

Name Node going down due to QJM timeout

Re: Name Node going down due to QJM timeout