Reply
Highlighted
Expert Contributor
Posts: 277
Registered: ‎01-25-2017
Accepted Solution

Intermittently one of the journal nodes get out of Sync

Hi,

 

I have 3 JNs, 2 on physical servers and the 3rd on virtual server with 6 Vcores.

 

Recently from time to time the vm server get out of sync for few seconds, I checked the vm resources and parmeters and nothing looks out of the rodinary, what is see in Cloudera manager metrics that the journal write bytes sometime are higher than different times

 

here what i see:

 

The active NameNode was out of sync with this JournalNode.

 

===============

 

org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException: Can't write txid 1659311573 expecting nextTxId=1659311555
	at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:485)
	at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:371)
	at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
	at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
	at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

 

 

 

Posts: 642
Topics: 3
Kudos: 105
Solutions: 67
Registered: ‎08-16-2016

Re: Intermittently one of the journal nodes get out of Sync

Even though the VM looks fine it is probably a resource contraint on the VM that is causing this issue.  

 

The Namenode writes each edit to its own local directory and all of the JN edits directories.  It simply sounds like the VM isn't keeping up or getting the job done in time.

 

Examine the contents of the JN edits directory on each and you will find that the VM does on contain all of the necessary edits.  You can manually copy the edits_* files to the VM nodes to get it back in sync and see if it happens again.  I do recommend using the same hardware for all three Master nodes that would run each JN and ZK instance.  Otherwise, you will often be found on just barely maintaining the quorum to stay running.

 

dfs.namenode.shared.edits.dir

dfs.journalnode.edits.dir

Expert Contributor
Posts: 277
Registered: ‎01-25-2017

Re: Intermittently one of the journal nodes get out of Sync

Indeed it's happening for few seconds and then the Vm get Sync, it happened
from time to time so sometimes i suspect that one job or hive query that
writes alot of blocks and files that may cause the issue.

Do you think i should examine this again? should i check the content of the
file itself? do you think if migrate the JN role from the vm to a stronger
node with 12 vcores can solve the issue?
Posts: 642
Topics: 3
Kudos: 105
Solutions: 67
Registered: ‎08-16-2016

Re: Intermittently one of the journal nodes get out of Sync

I do think that you need to move the JN to the same/similar hardware to what you have the others on.

You don't need to check the contents or the files itself. Since it is happening every few seconds it is just lagging behind and then catching up. So if you want to run any real loads on the cluster it needs to be moved to better hardware.
Expert Contributor
Posts: 277
Registered: ‎01-25-2017

Re: Intermittently one of the journal nodes get out of Sync

Is it familair to add JN on DataNode/NodeManager server?

 

In my cluster, the 2NNs are physical, the CM and the application server that hosts mysql and oozie are VMs servers, all other DataNodes are physical ones.

Posts: 642
Topics: 3
Kudos: 105
Solutions: 67
Registered: ‎08-16-2016

Re: Intermittently one of the journal nodes get out of Sync

No, typically Worker nodes are just the process that do the work, Datanode, Impala daemon, NodeManager.

In theory you could and have it on the OS disk (not on any HDFS disks) but you will eventually run into contention between the OS, logs, and the edits. But if you have a small cluster.

My minimum, for a production cluster and/or HA, is three large, physical servers for the Master.

The DBs (although I prefer to have the HMS DB on the Master nodes as well), gateway roles, CM can all be on VMs.

Where is your third ZK instance? As that one will also have IO contention issues on a VM or on a Datanode.
Expert Contributor
Posts: 277
Registered: ‎01-25-2017

Re: Intermittently one of the journal nodes get out of Sync

My 3rd ZK was on the same VM but after i got to this issue i moved the ZK to  another OpenStack servers and moved the spark history server one of the NNs to to reduce the load from the VM and increased the Vcores for the Vm to 6 cores but still have the same issue.

Expert Contributor
Posts: 277
Registered: ‎01-25-2017

Re: Intermittently one of the journal nodes get out of Sync

The intersting thing that i noticed when this happened at the same time some jobs that runs once a day write to HDFS relatively too much data and it's run with a good number of reducers betweeb 400-1100, which make me suspect in the blocks that written by these jobs at the same time and the vm is getting some lag, trying to find a way to approve this.

Posts: 642
Topics: 3
Kudos: 105
Solutions: 67
Registered: ‎08-16-2016

Re: Intermittently one of the journal nodes get out of Sync

That is probably the source in the spike in edits being written to the JNs. You could try to address it so reduce the impact.
Expert Contributor
Posts: 277
Registered: ‎01-25-2017

Re: Intermittently one of the journal nodes get out of Sync

Do you think looking at the edit logs size when this occur should be a good indication?

Announcements