In my setup, on namenode, I have to do manual checkpointing everytime from the namenode terminal.
I am following this article for the same.
The checkpoint delay as shown in the attachment is 3600 seconds.
I keep getting alerts that checkpointing is not done. Clearly, it is not happening automatically.
Am I missing anything here? How do we autmate checkpointing?
If you are keep getting this "NameNode Last Checkpoint" alert then it will be good to check if the NameNode is healthy? Like do you see long GC pause messages in the NameNode logs or any other warning? Was there a heavy load on the system when the scheduled checkpoint was supposed to be happening?
Normally ambari reports alerts when the underlying system is not healthy so better to check the namenode logs and heap usages/ GC Pauses ..etc
What is the heap size for your namenode is is according to the calculation listed here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_command-line-installation/content/config...
NameNode heap size depends on many factors, such as the number of files, the number of blocks, and the load on the system.
Do you see any warning / error while running the following commands manually on your own?
# su - hdfs # hdfs dfsadmin -safemode enter # hdfs dfsadmin -saveNamespace # hdfs dfsadmin -safemode leave
The above commands we can run as cron job , however it it better to check if you are getting any warning /error while running this command manually.
You can also try reducing the "dfs.namenode.checkpoint.txns" value to little lower value like 100000 and then check if it helps in fixing the alerts. The Secondary NameNode or CheckpointNode will create a checkpoint of the namespace every 'dfs.namenode.checkpoint.txns' transactions, regardless of whether 'dfs.namenode.checkpoint.period' has expired.
However such tuning depends on your usecase: https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
Thanks@Jay Kumar SenSharma
If you keep getting this "NameNode Last Checkpoint" alert then it will be good to check if the NameNode is healthy? Like do you see long GC pause messages in the NameNode logs or any other warning? Was there a heavy load on the system when the scheduled checkpoint was supposed to be happening?
==> No. I have been seeing this problem since day 1 of the setup (A few months so to say). And the system looks stable too.
I anyway checked for heap and the setup follows the standard mentioned in the link.
The problem is Ambari itself is not doing the checkpoint (Assuming Ambari to do it). I can do it with manual commands successfully.
Am I missing anything else here?
Regarding your query: "The problem is Ambari itself is not doing the checkpoint (Assuming Ambari to do it)."
>>>> Ambari is not responsible for doing the HDFS check pointing (rather it can simply alert if checkpoint did not happen). The Alert that you are getting is simply checking the HDFS Checkpoint time and reporting the alert.
The "NameNode Last Checkpoint" can be triggered if Too much time elapsed since last NameNode checkpoint. We can see this alert if the last time that the NameNode performed a checkpoint was too long ago or if the number of uncommitted transactions is beyond a certain threshold.
Checkpoointing is controlled by the following properties of HDFS configs so if it is not happening in regular interval then we will have to look the NN logs / gc logs / settings.
You could go to the Namenode current folder and check when was the last fsimage created.