Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to automate manual checkpointing on namenode

avatar
New Contributor

In my setup, on namenode, I have to do manual checkpointing everytime from the namenode terminal.

I am following this article for the same.

The checkpoint delay as shown in the attachment is 3600 seconds.

I keep getting alerts that checkpointing is not done. Clearly, it is not happening automatically.

Am I missing anything here? How do we autmate checkpointing?

77798-screen-shot-2018-06-21-at-25553-pm.png

77797-screen-shot-2018-06-21-at-24136-pm.png

4 REPLIES 4

avatar
Master Mentor

@Utkarsh Jadhav

If you are keep getting this "NameNode Last Checkpoint" alert then it will be good to check if the NameNode is healthy? Like do you see long GC pause messages in the NameNode logs or any other warning? Was there a heavy load on the system when the scheduled checkpoint was supposed to be happening?

Normally ambari reports alerts when the underlying system is not healthy so better to check the namenode logs and heap usages/ GC Pauses ..etc

What is the heap size for your namenode is is according to the calculation listed here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_command-line-installation/content/config...

NameNode heap size depends on many factors, such as the number of files, the number of blocks, and the load on the system.

.

Do you see any warning / error while running the following commands manually on your own?

# su - hdfs
# hdfs dfsadmin -safemode enter  
# hdfs dfsadmin -saveNamespace  
# hdfs dfsadmin -safemode leave 

.

The above commands we can run as cron job , however it it better to check if you are getting any warning /error while running this command manually.

You can also try reducing the "dfs.namenode.checkpoint.txns" value to little lower value like 100000 and then check if it helps in fixing the alerts. The Secondary NameNode or CheckpointNode will create a checkpoint of the namespace every 'dfs.namenode.checkpoint.txns' transactions, regardless of whether 'dfs.namenode.checkpoint.period' has expired.

However such tuning depends on your usecase: https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

avatar
New Contributor

this suggestion works for me.....

 

avatar
New Contributor

Thanks@Jay Kumar SenSharma


If you keep getting this "NameNode Last Checkpoint" alert then it will be good to check if the NameNode is healthy? Like do you see long GC pause messages in the NameNode logs or any other warning? Was there a heavy load on the system when the scheduled checkpoint was supposed to be happening?

==> No. I have been seeing this problem since day 1 of the setup (A few months so to say). And the system looks stable too.

I anyway checked for heap and the setup follows the standard mentioned in the link.

The problem is Ambari itself is not doing the checkpoint (Assuming Ambari to do it). I can do it with manual commands successfully.

Am I missing anything else here?

avatar
Master Mentor

@Utkarsh Jadhav

Regarding your query: "The problem is Ambari itself is not doing the checkpoint (Assuming Ambari to do it)."

>>>> Ambari is not responsible for doing the HDFS check pointing (rather it can simply alert if checkpoint did not happen). The Alert that you are getting is simply checking the HDFS Checkpoint time and reporting the alert.

The "NameNode Last Checkpoint" can be triggered if Too much time elapsed since last NameNode checkpoint. We can see this alert if the last time that the NameNode performed a checkpoint was too long ago or if the number of uncommitted transactions is beyond a certain threshold.

Checkpoointing is controlled by the following properties of HDFS configs so if it is not happening in regular interval then we will have to look the NN logs / gc logs / settings.

  • dfs.namenode.checkpoint.period, set to 1 hour by default, specifies the maximum delay between two consecutive checkpoints
  • dfs.namenode.checkpoint.txns, set to 1 million by default, defines the number of uncheckpointed transactions on the NameNode which will force an urgent checkpoint, even if the checkpoint period has not been reached.

You could go to the Namenode current folder and check when was the last fsimage created.

.