Created 08-03-2016 06:45 AM
I have a cluster with 10 nodes and each node having 2 TB diskspace and 250GB RAM. While writing 1TB data, namenode goes down [ HA NameNode ] with below error. I have ran this multiple time and everytime, it is the same issue.
016-08-03 05:56:43,002 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 8783ms to send a batch of 4 edits (711 bytes) to remote journal 172.27.27.0:8485
2016-08-03 05:56:43,005 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 172.27.29.0:8485 failed to write txns 330736-330807. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 33 is less than the last promised epoch 34
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:456)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)
Created 08-04-2016 06:58 AM
So this is what I did , since the datanode and zookeeper was writing to the same disk, the zookeeper writes was slowing down, due to which all the services dependent on zookeeper was going down.
Soln: Brought down the datanode's on the zookeeper machines and started the job -- This has solved the problem for now.
Created 08-04-2016 06:58 AM
So this is what I did , since the datanode and zookeeper was writing to the same disk, the zookeeper writes was slowing down, due to which all the services dependent on zookeeper was going down.
Soln: Brought down the datanode's on the zookeeper machines and started the job -- This has solved the problem for now.