02-18-2017 01:07 AM
I have 3 JNs, 2 on physical servers and the 3rd on virtual server with 6 Vcores.
Recently from time to time the vm server get out of sync for few seconds, I checked the vm resources and parmeters and nothing looks out of the rodinary, what is see in Cloudera manager metrics that the journal write bytes sometime are higher than different times
here what i see:
The active NameNode was out of sync with this JournalNode.
org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException: Can't write txid 1659311573 expecting nextTxId=1659311555 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:485) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:371) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
Solved! Go to Solution.
02-18-2017 10:08 AM
Even though the VM looks fine it is probably a resource contraint on the VM that is causing this issue.
The Namenode writes each edit to its own local directory and all of the JN edits directories. It simply sounds like the VM isn't keeping up or getting the job done in time.
Examine the contents of the JN edits directory on each and you will find that the VM does on contain all of the necessary edits. You can manually copy the edits_* files to the VM nodes to get it back in sync and see if it happens again. I do recommend using the same hardware for all three Master nodes that would run each JN and ZK instance. Otherwise, you will often be found on just barely maintaining the quorum to stay running.
02-18-2017 10:17 AM
02-18-2017 10:21 AM
02-18-2017 10:25 AM
Is it familair to add JN on DataNode/NodeManager server?
In my cluster, the 2NNs are physical, the CM and the application server that hosts mysql and oozie are VMs servers, all other DataNodes are physical ones.
02-18-2017 10:33 AM
02-18-2017 10:42 AM
My 3rd ZK was on the same VM but after i got to this issue i moved the ZK to another OpenStack servers and moved the spark history server one of the NNs to to reduce the load from the VM and increased the Vcores for the Vm to 6 cores but still have the same issue.
02-20-2017 06:46 PM
The intersting thing that i noticed when this happened at the same time some jobs that runs once a day write to HDFS relatively too much data and it's run with a good number of reducers betweeb 400-1100, which make me suspect in the blocks that written by these jobs at the same time and the vm is getting some lag, trying to find a way to approve this.