Created on 02-18-2017 01:07 AM - edited 09-16-2022 04:06 AM
Hi,
I have 3 JNs, 2 on physical servers and the 3rd on virtual server with 6 Vcores.
Recently from time to time the vm server get out of sync for few seconds, I checked the vm resources and parmeters and nothing looks out of the rodinary, what is see in Cloudera manager metrics that the journal write bytes sometime are higher than different times
here what i see:
The active NameNode was out of sync with this JournalNode.
===============
org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException: Can't write txid 1659311573 expecting nextTxId=1659311555 at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:485) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:371) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
Created 03-05-2017 02:26 PM
When i checked the job/the query that occur prior to the alert on the JN, i found one hive query that runs on a data of 6 months and recreate the hive table from new, which resulted in a good percentage of edit logs, i contacted the query owner and he reduced the his running window from 6 months to 2 months which solve for us the issue.
Created 02-20-2017 11:08 PM
Created 03-05-2017 02:26 PM
When i checked the job/the query that occur prior to the alert on the JN, i found one hive query that runs on a data of 6 months and recreate the hive table from new, which resulted in a good percentage of edit logs, i contacted the query owner and he reduced the his running window from 6 months to 2 months which solve for us the issue.