Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Name Node going down due to QJM timeout

avatar

I have a cluster with 10 nodes and each node having 2 TB diskspace and 250GB RAM. While writing 1TB data, namenode goes down [ HA NameNode ] with below error. I have ran this multiple time and everytime, it is the same issue.

016-08-03 05:56:43,002 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 8783ms to send a batch of 4 edits (711 bytes) to remote journal 172.27.27.0:8485

2016-08-03 05:56:43,005 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 172.27.29.0:8485 failed to write txns 330736-330807. Will try to write to this JN again after the next log roll.

org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 33 is less than the last promised epoch 34

at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428)

at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:456)

at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351)

at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152)

at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)

at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)

1 ACCEPTED SOLUTION

avatar

So this is what I did , since the datanode and zookeeper was writing to the same disk, the zookeeper writes was slowing down, due to which all the services dependent on zookeeper was going down.

Soln: Brought down the datanode's on the zookeeper machines and started the job -- This has solved the problem for now.

View solution in original post

1 REPLY 1

avatar

So this is what I did , since the datanode and zookeeper was writing to the same disk, the zookeeper writes was slowing down, due to which all the services dependent on zookeeper was going down.

Soln: Brought down the datanode's on the zookeeper machines and started the job -- This has solved the problem for now.