Support Questions

Find answers, ask questions, and share your expertise

Name Node going down due to QJM timeout

avatar

I have a cluster with 10 nodes and each node having 2 TB diskspace and 250GB RAM. While writing 1TB data, namenode goes down [ HA NameNode ] with below error. I have ran this multiple time and everytime, it is the same issue.

016-08-03 05:56:43,002 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 8783ms to send a batch of 4 edits (711 bytes) to remote journal 172.27.27.0:8485

2016-08-03 05:56:43,005 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 172.27.29.0:8485 failed to write txns 330736-330807. Will try to write to this JN again after the next log roll.

org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 33 is less than the last promised epoch 34

at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428)

at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:456)

at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351)

at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152)

at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)

at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)

1 ACCEPTED SOLUTION

avatar

So this is what I did , since the datanode and zookeeper was writing to the same disk, the zookeeper writes was slowing down, due to which all the services dependent on zookeeper was going down.

Soln: Brought down the datanode's on the zookeeper machines and started the job -- This has solved the problem for now.

View solution in original post

1 REPLY 1

avatar

So this is what I did , since the datanode and zookeeper was writing to the same disk, the zookeeper writes was slowing down, due to which all the services dependent on zookeeper was going down.

Soln: Brought down the datanode's on the zookeeper machines and started the job -- This has solved the problem for now.