I have been facing frequent NN failover & checkpointing issues on my 6 node cluster on VMs, mostly the standby would remain in stopped state post a failover until started manually. Have tried increasing the QJM timeouts that helped with failover. However, the checkpointing issue remains. Have 3 journal nodes on node02, node03, node4 for example. Have found multiple edits_inprogress files on one of the journal node (node02), the other two had one each edits_inprogress.empty file (viz. on node03 &node04). After taking a backup, I deleted the extra edits_inprogress (leaving the most recent one) & the edits_inprogress.empty on the other two. Restarted all the JNs one by one. However, after checking the current directory on my journal node (node02) later - unfortunately it has two edits_inprogress files yet again. I can't seem to understand this behavior of this journal node. How & what is causing it to generate more than one edits_inprogress on this particular node (node02) ? Kindly help.
Currently node03 is active NN.
Last promised epoch on all 3 nodes : 591 &
Number of file in /hadoop/hdfs/journal/XXXXcluster/current are as follows:
node02 : 10281
node03 : 10284
node04 : 10284
Yes, still experiencing the same problem.
(Last night, standby NN i.e. node02 was in stopped state, had to manually start it from Ambari. Checked the
Last checkpoint : Tue May 21 2019 17:14:59, as on
ntp time : Wed May 22 2019 1:03:36 )
As for today,
Last checkpoint time : Wed May 22 2019 10:18:33 , whereas
ntp time : Wed May 22 2019 19:10:52 )
node02 is the problem proceed with the documentation I had compiled before. follow diligently
Very important always backup the files on all the nodes before starting the procedure !!!!! just a simple zip is okay and then copy over the good file from node3 or node1 /hadoop/hdfs/journal/XXXXcluster/current /* to node2 same direction and follow steps
Please revert am sure will smiles 🙂
Completed the steps on node02 & node03 from the healthy set of edits from node04.
(there were 2 edits_inprogress on node02 & one edits_inprogress.empty on node03)
There were some permissions issues, got through. However, while restarting HDFS components through ambari the Namenode on node03 was stuck & not able to start. After a pretty long time the operation completed but the standby NN on node03 was in stopped state. The errors are as attached. (unable to upload file hence the screenshot)
Both NNs are up and running for now. Observing the cluster for another 24 hours to see if the issue has resolved for the better. Will keep you posted about the checkpointing status.
Thanks a lot for your invaluable help.
Thanks and Regards,
Please keep me posted but, normally there shouldn't be an issue I have performed these steps on production clusters. I have seen the screenshot but that edits_inprogress is dated May 10 and the rest have correct date stamps.
Anyways let's keep watch
Checkpointing issue remains. Last checkpoint was done 18 hours back. No fsimage after that.
Also, the good node i.e. node04 has two edits_inprogress now. I don't know what to comprehend of this.
The initial screenshot was showing a discrepancy of edit_* on node02 with last epoch 542 and the new screenshot is that of node04 can you provide the comparison tables of the 3 nodes? Edit_* and last epoch and I would also want to see the screenshot of the Ambari UI of the Primary and Standby namenode status.
To test the failover you can stop the active namenode it should failover to the standby the process could take some seconds and once the standby had transitioned to Primary then restart the Ex-primary which should become standby namenode
what is the status of the ZKFailoverControllers?