Support Questions

Find answers, ask questions, and share your expertise

Namenode bad health and checkpoint status issue

avatar
Rising Star

Hi all, I have HDFS service running on my CDP 7.1.8 private cloud base cluster with Kerberos enabled.

 

Recently, I got two issues with my HDFS NameNode, here is the screen capture:

The first one

namenode bad health.PNG

 

The second one:

Namenode checkpoint stauts issue.PNG

When looking into the role log, it shows

role log.PNG

 

Could anyone point out the root cause and the solution for this issue for me please? Thanks in advance.

 

Please let me know if I need to provide more information.

2 REPLIES 2

avatar
Super Collaborator

Hi @BrianChan Both the alerts are related. The checkpointing is done by the Standby Namenode and if it's not functioning properly, then the checkpoint process is not done and you will see those alerts.

 

You can go through the logs of the Standby Namenode and check why the checkpoint thread is stopped. Maybe the Standby Namenode is down? So you may want to restart the Standby Namenode to fix this.

avatar
Master Mentor

@BrianChan 

You will need to manually perform the checkpoint on the faulty node. If the standby NameNode is faulty for a long time, generated edit log will accumulate. In this case, this will cause the HDFS or active NN to take a long time to restart and could even fail to restart because if the HDFS or active NameNode is restarted, the active NameNode reads a large amount of unmerged editlog.

Is your NN setup active/standby?
Fr the below steps you could as well use CM UI to perfom the tasks

Quickest solution 1
I have had occasions when a simple rolling restart of the Zk's would resolve that biut I see the checkpoint lag goes to > 2 days

Solution 2
Check the most up to date on both NN by comparing the dates of files in the directory.

Spoiler
$ ls -lrt /dfs/nn/current/

On the Active NN with the latest editlogs as hdfs user

Spoiler
$ hdfs dfsadmin -safemode enter
Spoiler
$ hdfs dfsadmin -saveNamespace

Check whether the latest generated fsimage timestamp is the current time. If yes, the combination is executed correctly and is complete.

Spoiler
$ hdfs dfsadmin -safemode leave

Before restarting the HDFS or active NameNode, perform a checkpoint manually to merge the metadata of the active NameNode.
The restart the standby the newly generated files should now automatically be shipped and synced this could take a while < 5 minutes and your NN should all be green