Created 03-20-2018 10:31 AM
Hi,
I'm looking at checkpoint alert in NN ha environment. Where i have last checkpoint was completed 22 hours ago.
I'm doing checkpoint manually by command line. how can i do it automatically and how can we ignore these alerts about checkpoint in UI.
Created 03-22-2018 06:20 AM
It's working now. Check pointing period is 6-7 hours. During that period, NN was down.
Thanks
Created 03-20-2018 11:04 AM
Ambari basically relies on the NameNode JMX call to find out the "LastCheckpointTime"
Something like this: https://github.com/apache/ambari/blob/trunk/ambari-server/src/main/resources/common-services/HDFS/2....
# curl "http://hdfcluster1.example.com:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem" | grep 'LastCheckpointTime'
.
For example if the above JMX call returns the epoch time as '1521523640579' then please convert it to the human readable time to find out what is correct time when the LastCheckPoint happened on nameNode.
# date -d @1521523640
NOTE-1: if your Ambari Cluster Hosts are not time sync then it might happen that the last checkpoint computation might go wrong.
NOTE-2: Every cluster node (including Ambari Server Host) should be able to resolve the NameNode JMX url. Else if the call will be made from any particular host where the alert is executed then it might not be able to make the jmx call to NN and it might give unknown results.
Created 03-20-2018 11:12 AM
Below is the output which i found
[root@slave0 centos]#curl "http://slave1.dl.com:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem" | grep 'LastCheckpointTime'
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1427 0 1427 0 0 183k 0 --:--:-- --:--:-- --:--:-- 199k "LastCheckpointTime" : 1521460953000,
[root@slave0 centos]# date -d @1521460953000
Fri Mar 7 00:00:00 IST 50183
It may be automatically checkpoint is not happening, While time is sync between servers.
Created 03-20-2018 11:38 AM
In your epoch time command please remove 3 last digith to get accurate date:
# date -d @1521460953 Mon Mar 19 12:02:33 UTC 2018
.
Created 03-20-2018 11:41 AM
So if your NameNode shows the LastCheckpoint time is around "Mon Mar 19 12:02:33 UTC 2018" then ambari might be showing right alert "Last Checkpoint: [22 hours, 19 minutes, 45507 transactions]"
So you should check from NameNode side if the check pointing is not happening on regular interval. Also please check the following property value and the NameNoide log to see any check pointing related warning / errors.
Specifies the number of seconds between two periodic checkpoints.
Created 03-22-2018 06:20 AM
It's working now. Check pointing period is 6-7 hours. During that period, NN was down.
Thanks