Created 04-02-2023 08:10 PM
Hi all, I have HDFS service running on my CDP 7.1.8 private cloud base cluster with Kerberos enabled.
Recently, I got two issues with my HDFS NameNode, here is the screen capture:
The first one
The second one:
When looking into the role log, it shows
Could anyone point out the root cause and the solution for this issue for me please? Thanks in advance.
Please let me know if I need to provide more information.
Created 04-03-2023 10:51 PM
Hi @BrianChan Both the alerts are related. The checkpointing is done by the Standby Namenode and if it's not functioning properly, then the checkpoint process is not done and you will see those alerts.
You can go through the logs of the Standby Namenode and check why the checkpoint thread is stopped. Maybe the Standby Namenode is down? So you may want to restart the Standby Namenode to fix this.
Created 04-06-2023 02:15 PM
You will need to manually perform the checkpoint on the faulty node. If the standby NameNode is faulty for a long time, generated edit log will accumulate. In this case, this will cause the HDFS or active NN to take a long time to restart and could even fail to restart because if the HDFS or active NameNode is restarted, the active NameNode reads a large amount of unmerged editlog.
Is your NN setup active/standby?
Fr the below steps you could as well use CM UI to perfom the tasks
Quickest solution 1
I have had occasions when a simple rolling restart of the Zk's would resolve that biut I see the checkpoint lag goes to > 2 days
Solution 2
Check the most up to date on both NN by comparing the dates of files in the directory.
On the Active NN with the latest editlogs as hdfs user
Check whether the latest generated fsimage timestamp is the current time. If yes, the combination is executed correctly and is complete.
Before restarting the HDFS or active NameNode, perform a checkpoint manually to merge the metadata of the active NameNode.
The restart the standby the newly generated files should now automatically be shipped and synced this could take a while < 5 minutes and your NN should all be green