Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Where can I find the time since the last checkpoint in a HA setup?

avatar
Expert Contributor

Hey guys,

I got a question concerning monitoring/operating HDFS:

The time since last checkpoint is one of the metrics I want to keep an eye on, i.e. the time since the edits and fsimage were last consolidated to a new fsimage.

In a non-HA environment, you can easily find the time on the Secondary Namenode WebUI at snn-address:50090.

My question is, where to find these informations in a HA-environment. Neither the Activce NameNode nor the Standby NameNode seem to show similar information.

Best,

Benjamin

1 ACCEPTED SOLUTION

avatar
Guru

You are right that it is not on the NN UI.

But you can get this from JMX (LastCheckpointTime which shows 1462223090012).

{
    "name" : "Hadoop:service=NameNode,name=FSNamesystem",
    "modelerType" : "FSNamesystem",
    "tag.Context" : "dfs",
    "tag.HAState" : "standby",
    "tag.TotalSyncTimes" : "",
    "tag.Hostname" : "demo2.cloud.hortonworks.com",
    "MissingBlocks" : 0,
    "MissingReplOneBlocks" : 0,
    "ExpiredHeartbeats" : 0,
    "TransactionsSinceLastCheckpoint" : -8853,
    "TransactionsSinceLastLogRoll" : 0,
    "LastWrittenTransactionId" : 50372,
    "LastCheckpointTime" : 1462223090012,
    "CapacityTotal" : 44338987008,
    "CapacityTotalGB" : 41.0,
    "CapacityUsed" : 4014164154,
    "CapacityUsedGB" : 4.0,
    "CapacityRemaining" : 13009718052,
    "CapacityRemainingGB" : 12.0,
    "CapacityUsedNonDFS" : 27315104802,
    "TotalLoad" : 22,
    "SnapshottableDirectories" : 0,
    "Snapshots" : 0,
    "LockQueueLength" : 0,
    "BlocksTotal" : 1145,
    "NumFilesUnderConstruction" : 4,
    "NumActiveClients" : 4,
    "FilesTotal" : 1385,
    "PendingReplicationBlocks" : 0,
    "UnderReplicatedBlocks" : 0,
    "CorruptBlocks" : 0,
    "ScheduledReplicationBlocks" : 0,
    "PendingDeletionBlocks" : 0,
    "ExcessBlocks" : 0,
    "PostponedMisreplicatedBlocks" : 0,
    "PendingDataNodeMessageCount" : 4,
    "MillisSinceLastLoadedEdits" : 39983,
    "BlockCapacity" : 2097152,
    "StaleDataNodes" : 0,
    "TotalFiles" : 1385,
    "TotalSyncCount" : 0
  }

You could also monitor Standby NN log for this. Below is the log from Standby NN during checkpoint.

2016-05-02 17:04:49,810 INFO  ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(336)) - Triggering checkpoint because it has been 21600 seconds since the last checkpoint, which exceeds the configured interval 21600
2016-05-02 17:04:49,810 INFO  namenode.FSImage (FSImage.java:saveNamespace(1090)) - Save namespace ...
2016-05-02 17:04:50,014 INFO  namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(203)) - Going to retain 2 images with txid >= 50371
2016-05-02 17:04:50,187 INFO  namenode.TransferFsImage (TransferFsImage.java:setTimeout(443)) - Image Transfer timeout configured to 60000 milliseconds
2016-05-02 17:04:50,339 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2016-05-02 17:04:50,363 INFO  namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(237)) - Uploaded image with txid 59225 to namenode at http://demo1.cloud.hortonworks.com:50070 in 0.222 seconds
2016-05-02 17:04:53,644 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2016-05-02 17:05:19,677 INFO  ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(271)) - Triggering log roll on remote NameNode xomdemo1.cloud.hortonworks.com/172.24.64.97:8020
2016-05-02 17:05:21,042 INFO  namenode.FSImage (FSImage.java:loadEdits(834)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6fba8d60 expecting start txid #59226 

View solution in original post

3 REPLIES 3

avatar
Guru

You are right that it is not on the NN UI.

But you can get this from JMX (LastCheckpointTime which shows 1462223090012).

{
    "name" : "Hadoop:service=NameNode,name=FSNamesystem",
    "modelerType" : "FSNamesystem",
    "tag.Context" : "dfs",
    "tag.HAState" : "standby",
    "tag.TotalSyncTimes" : "",
    "tag.Hostname" : "demo2.cloud.hortonworks.com",
    "MissingBlocks" : 0,
    "MissingReplOneBlocks" : 0,
    "ExpiredHeartbeats" : 0,
    "TransactionsSinceLastCheckpoint" : -8853,
    "TransactionsSinceLastLogRoll" : 0,
    "LastWrittenTransactionId" : 50372,
    "LastCheckpointTime" : 1462223090012,
    "CapacityTotal" : 44338987008,
    "CapacityTotalGB" : 41.0,
    "CapacityUsed" : 4014164154,
    "CapacityUsedGB" : 4.0,
    "CapacityRemaining" : 13009718052,
    "CapacityRemainingGB" : 12.0,
    "CapacityUsedNonDFS" : 27315104802,
    "TotalLoad" : 22,
    "SnapshottableDirectories" : 0,
    "Snapshots" : 0,
    "LockQueueLength" : 0,
    "BlocksTotal" : 1145,
    "NumFilesUnderConstruction" : 4,
    "NumActiveClients" : 4,
    "FilesTotal" : 1385,
    "PendingReplicationBlocks" : 0,
    "UnderReplicatedBlocks" : 0,
    "CorruptBlocks" : 0,
    "ScheduledReplicationBlocks" : 0,
    "PendingDeletionBlocks" : 0,
    "ExcessBlocks" : 0,
    "PostponedMisreplicatedBlocks" : 0,
    "PendingDataNodeMessageCount" : 4,
    "MillisSinceLastLoadedEdits" : 39983,
    "BlockCapacity" : 2097152,
    "StaleDataNodes" : 0,
    "TotalFiles" : 1385,
    "TotalSyncCount" : 0
  }

You could also monitor Standby NN log for this. Below is the log from Standby NN during checkpoint.

2016-05-02 17:04:49,810 INFO  ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(336)) - Triggering checkpoint because it has been 21600 seconds since the last checkpoint, which exceeds the configured interval 21600
2016-05-02 17:04:49,810 INFO  namenode.FSImage (FSImage.java:saveNamespace(1090)) - Save namespace ...
2016-05-02 17:04:50,014 INFO  namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(203)) - Going to retain 2 images with txid >= 50371
2016-05-02 17:04:50,187 INFO  namenode.TransferFsImage (TransferFsImage.java:setTimeout(443)) - Image Transfer timeout configured to 60000 milliseconds
2016-05-02 17:04:50,339 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2016-05-02 17:04:50,363 INFO  namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(237)) - Uploaded image with txid 59225 to namenode at http://demo1.cloud.hortonworks.com:50070 in 0.222 seconds
2016-05-02 17:04:53,644 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2016-05-02 17:05:19,677 INFO  ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(271)) - Triggering log roll on remote NameNode xomdemo1.cloud.hortonworks.com/172.24.64.97:8020
2016-05-02 17:05:21,042 INFO  namenode.FSImage (FSImage.java:loadEdits(834)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6fba8d60 expecting start txid #59226 

avatar
Expert Contributor

Thanks, that clears it up!

A follow up question: Is there a way to create a widget for Ambari telling me, how long ago the last checkpoint happened? I see, that I can create a widget showing me the LastCheckpointTime from JMX, but that number is not really intuitive to our Administrators.

avatar
Guru

As far as I know, you cannot add custom conversion logic (like converting from epoch to date). But you might get more eyes watching and better results if you open a new question for this and tag it with ambari and ambari-metrics.