Hi @BrianChan Both the alerts are related. The checkpointing is done by the Standby Namenode and if it's not functioning properly, then the checkpoint process is not done and you will see those alerts.
You can go through the logs of the Standby Namenode and check why the checkpoint thread is stopped. Maybe the Standby Namenode is down? So you may want to restart the Standby Namenode to fix this.
You will need to manually perform the checkpoint on the faulty node. If the standby NameNode is faulty for a long time, generated edit log will accumulate. In this case, this will cause the HDFS or active NN to take a long time to restart and could even fail to restart because if the HDFS or active NameNode is restarted, the active NameNode reads a large amount of unmerged editlog.
Is your NN setup active/standby? Fr the below steps you could as well use CM UI to perfom the tasks
Quickest solution 1 I have had occasions when a simple rolling restart of the Zk's would resolve that biut I see the checkpoint lag goes to > 2 days
Solution 2 Check the most up to date on both NN by comparing the dates of files in the directory.
Before restarting the HDFS or active NameNode, perform a checkpoint manually to merge the metadata of the active NameNode. The restart the standby the newly generated files should now automatically be shipped and synced this could take a while < 5 minutes and your NN should all be green