Support Questions

sgowda · ‎08-03-2016

I have a cluster with 10 nodes and each node having 2 TB diskspace and 250GB RAM. While writing 1TB data, namenode goes down [ HA NameNode ] with below error. I have ran this multiple time and everytime, it is the same issue.

016-08-03 05:56:43,002 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 8783ms to send a batch of 4 edits (711 bytes) to remote journal 172.27.27.0:8485

2016-08-03 05:56:43,005 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 172.27.29.0:8485 failed to write txns 330736-330807. Will try to write to this JN again after the next log roll.

org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 33 is less than the last promised epoch 34

at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428)

at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:456)

at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351)

at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152)

at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)

at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)

at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)

sgowda · ‎08-04-2016

So this is what I did , since the datanode and zookeeper was writing to the same disk, the zookeeper writes was slowing down, due to which all the services dependent on zookeeper was going down.

Soln: Brought down the datanode's on the zookeeper machines and started the job -- This has solved the problem for now.

View solution in original post

sgowda · ‎08-04-2016

So this is what I did , since the datanode and zookeeper was writing to the same disk, the zookeeper writes was slowing down, due to which all the services dependent on zookeeper was going down.

Soln: Brought down the datanode's on the zookeeper machines and started the job -- This has solved the problem for now.

Cloudera Community

Support Questions

Name Node going down due to QJM timeout

Disconnecting from node due to socket connection t...

How QJM Works in Namenode HA

Name nodes crash due to journal timeouts

Frequent journal node connection timeout alerts

Node Manager crashed due to segmentation fault.

CDSW Session Couldn't Start Due To Node Taints - n...

How to connect Go Applications to Cloudera Operati...

Node Managers turns unhealthy due to heavy volume ...

Query blah expired due to client inactivity (timeo...

ERROR: “Python script has been killed due to timeo...