Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Where can I find the time since the last checkpoint in a HA setup?

avatar
Expert Contributor

Hey guys,

I got a question concerning monitoring/operating HDFS:

The time since last checkpoint is one of the metrics I want to keep an eye on, i.e. the time since the edits and fsimage were last consolidated to a new fsimage.

In a non-HA environment, you can easily find the time on the Secondary Namenode WebUI at snn-address:50090.

My question is, where to find these informations in a HA-environment. Neither the Activce NameNode nor the Standby NameNode seem to show similar information.

Best,

Benjamin

1 ACCEPTED SOLUTION

avatar
Guru

You are right that it is not on the NN UI.

But you can get this from JMX (LastCheckpointTime which shows 1462223090012).

{
    "name" : "Hadoop:service=NameNode,name=FSNamesystem",
    "modelerType" : "FSNamesystem",
    "tag.Context" : "dfs",
    "tag.HAState" : "standby",
    "tag.TotalSyncTimes" : "",
    "tag.Hostname" : "demo2.cloud.hortonworks.com",
    "MissingBlocks" : 0,
    "MissingReplOneBlocks" : 0,
    "ExpiredHeartbeats" : 0,
    "TransactionsSinceLastCheckpoint" : -8853,
    "TransactionsSinceLastLogRoll" : 0,
    "LastWrittenTransactionId" : 50372,
    "LastCheckpointTime" : 1462223090012,
    "CapacityTotal" : 44338987008,
    "CapacityTotalGB" : 41.0,
    "CapacityUsed" : 4014164154,
    "CapacityUsedGB" : 4.0,
    "CapacityRemaining" : 13009718052,
    "CapacityRemainingGB" : 12.0,
    "CapacityUsedNonDFS" : 27315104802,
    "TotalLoad" : 22,
    "SnapshottableDirectories" : 0,
    "Snapshots" : 0,
    "LockQueueLength" : 0,
    "BlocksTotal" : 1145,
    "NumFilesUnderConstruction" : 4,
    "NumActiveClients" : 4,
    "FilesTotal" : 1385,
    "PendingReplicationBlocks" : 0,
    "UnderReplicatedBlocks" : 0,
    "CorruptBlocks" : 0,
    "ScheduledReplicationBlocks" : 0,
    "PendingDeletionBlocks" : 0,
    "ExcessBlocks" : 0,
    "PostponedMisreplicatedBlocks" : 0,
    "PendingDataNodeMessageCount" : 4,
    "MillisSinceLastLoadedEdits" : 39983,
    "BlockCapacity" : 2097152,
    "StaleDataNodes" : 0,
    "TotalFiles" : 1385,
    "TotalSyncCount" : 0
  }

You could also monitor Standby NN log for this. Below is the log from Standby NN during checkpoint.

2016-05-02 17:04:49,810 INFO  ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(336)) - Triggering checkpoint because it has been 21600 seconds since the last checkpoint, which exceeds the configured interval 21600
2016-05-02 17:04:49,810 INFO  namenode.FSImage (FSImage.java:saveNamespace(1090)) - Save namespace ...
2016-05-02 17:04:50,014 INFO  namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(203)) - Going to retain 2 images with txid >= 50371
2016-05-02 17:04:50,187 INFO  namenode.TransferFsImage (TransferFsImage.java:setTimeout(443)) - Image Transfer timeout configured to 60000 milliseconds
2016-05-02 17:04:50,339 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2016-05-02 17:04:50,363 INFO  namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(237)) - Uploaded image with txid 59225 to namenode at http://demo1.cloud.hortonworks.com:50070 in 0.222 seconds
2016-05-02 17:04:53,644 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2016-05-02 17:05:19,677 INFO  ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(271)) - Triggering log roll on remote NameNode xomdemo1.cloud.hortonworks.com/172.24.64.97:8020
2016-05-02 17:05:21,042 INFO  namenode.FSImage (FSImage.java:loadEdits(834)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6fba8d60 expecting start txid #59226 

View solution in original post

3 REPLIES 3

avatar
Guru

You are right that it is not on the NN UI.

But you can get this from JMX (LastCheckpointTime which shows 1462223090012).

{
    "name" : "Hadoop:service=NameNode,name=FSNamesystem",
    "modelerType" : "FSNamesystem",
    "tag.Context" : "dfs",
    "tag.HAState" : "standby",
    "tag.TotalSyncTimes" : "",
    "tag.Hostname" : "demo2.cloud.hortonworks.com",
    "MissingBlocks" : 0,
    "MissingReplOneBlocks" : 0,
    "ExpiredHeartbeats" : 0,
    "TransactionsSinceLastCheckpoint" : -8853,
    "TransactionsSinceLastLogRoll" : 0,
    "LastWrittenTransactionId" : 50372,
    "LastCheckpointTime" : 1462223090012,
    "CapacityTotal" : 44338987008,
    "CapacityTotalGB" : 41.0,
    "CapacityUsed" : 4014164154,
    "CapacityUsedGB" : 4.0,
    "CapacityRemaining" : 13009718052,
    "CapacityRemainingGB" : 12.0,
    "CapacityUsedNonDFS" : 27315104802,
    "TotalLoad" : 22,
    "SnapshottableDirectories" : 0,
    "Snapshots" : 0,
    "LockQueueLength" : 0,
    "BlocksTotal" : 1145,
    "NumFilesUnderConstruction" : 4,
    "NumActiveClients" : 4,
    "FilesTotal" : 1385,
    "PendingReplicationBlocks" : 0,
    "UnderReplicatedBlocks" : 0,
    "CorruptBlocks" : 0,
    "ScheduledReplicationBlocks" : 0,
    "PendingDeletionBlocks" : 0,
    "ExcessBlocks" : 0,
    "PostponedMisreplicatedBlocks" : 0,
    "PendingDataNodeMessageCount" : 4,
    "MillisSinceLastLoadedEdits" : 39983,
    "BlockCapacity" : 2097152,
    "StaleDataNodes" : 0,
    "TotalFiles" : 1385,
    "TotalSyncCount" : 0
  }

You could also monitor Standby NN log for this. Below is the log from Standby NN during checkpoint.

2016-05-02 17:04:49,810 INFO  ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(336)) - Triggering checkpoint because it has been 21600 seconds since the last checkpoint, which exceeds the configured interval 21600
2016-05-02 17:04:49,810 INFO  namenode.FSImage (FSImage.java:saveNamespace(1090)) - Save namespace ...
2016-05-02 17:04:50,014 INFO  namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(203)) - Going to retain 2 images with txid >= 50371
2016-05-02 17:04:50,187 INFO  namenode.TransferFsImage (TransferFsImage.java:setTimeout(443)) - Image Transfer timeout configured to 60000 milliseconds
2016-05-02 17:04:50,339 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2016-05-02 17:04:50,363 INFO  namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(237)) - Uploaded image with txid 59225 to namenode at http://demo1.cloud.hortonworks.com:50070 in 0.222 seconds
2016-05-02 17:04:53,644 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2016-05-02 17:05:19,677 INFO  ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(271)) - Triggering log roll on remote NameNode xomdemo1.cloud.hortonworks.com/172.24.64.97:8020
2016-05-02 17:05:21,042 INFO  namenode.FSImage (FSImage.java:loadEdits(834)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6fba8d60 expecting start txid #59226 

avatar
Expert Contributor

Thanks, that clears it up!

A follow up question: Is there a way to create a widget for Ambari telling me, how long ago the last checkpoint happened? I see, that I can create a widget showing me the LastCheckpointTime from JMX, but that number is not really intuitive to our Administrators.

avatar
Guru

As far as I know, you cannot add custom conversion logic (like converting from epoch to date). But you might get more eyes watching and better results if you open a new question for this and tag it with ambari and ambari-metrics.