Created 05-02-2016 09:52 AM
Hey guys,
I got a question concerning monitoring/operating HDFS:
The time since last checkpoint is one of the metrics I want to keep an eye on, i.e. the time since the edits and fsimage were last consolidated to a new fsimage.
In a non-HA environment, you can easily find the time on the Secondary Namenode WebUI at snn-address:50090.
My question is, where to find these informations in a HA-environment. Neither the Activce NameNode nor the Standby NameNode seem to show similar information.
Best,
Benjamin
Created 05-03-2016 02:11 AM
You are right that it is not on the NN UI.
But you can get this from JMX (LastCheckpointTime which shows 1462223090012).
{ "name" : "Hadoop:service=NameNode,name=FSNamesystem", "modelerType" : "FSNamesystem", "tag.Context" : "dfs", "tag.HAState" : "standby", "tag.TotalSyncTimes" : "", "tag.Hostname" : "demo2.cloud.hortonworks.com", "MissingBlocks" : 0, "MissingReplOneBlocks" : 0, "ExpiredHeartbeats" : 0, "TransactionsSinceLastCheckpoint" : -8853, "TransactionsSinceLastLogRoll" : 0, "LastWrittenTransactionId" : 50372, "LastCheckpointTime" : 1462223090012, "CapacityTotal" : 44338987008, "CapacityTotalGB" : 41.0, "CapacityUsed" : 4014164154, "CapacityUsedGB" : 4.0, "CapacityRemaining" : 13009718052, "CapacityRemainingGB" : 12.0, "CapacityUsedNonDFS" : 27315104802, "TotalLoad" : 22, "SnapshottableDirectories" : 0, "Snapshots" : 0, "LockQueueLength" : 0, "BlocksTotal" : 1145, "NumFilesUnderConstruction" : 4, "NumActiveClients" : 4, "FilesTotal" : 1385, "PendingReplicationBlocks" : 0, "UnderReplicatedBlocks" : 0, "CorruptBlocks" : 0, "ScheduledReplicationBlocks" : 0, "PendingDeletionBlocks" : 0, "ExcessBlocks" : 0, "PostponedMisreplicatedBlocks" : 0, "PendingDataNodeMessageCount" : 4, "MillisSinceLastLoadedEdits" : 39983, "BlockCapacity" : 2097152, "StaleDataNodes" : 0, "TotalFiles" : 1385, "TotalSyncCount" : 0 }
You could also monitor Standby NN log for this. Below is the log from Standby NN during checkpoint.
2016-05-02 17:04:49,810 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(336)) - Triggering checkpoint because it has been 21600 seconds since the last checkpoint, which exceeds the configured interval 21600 2016-05-02 17:04:49,810 INFO namenode.FSImage (FSImage.java:saveNamespace(1090)) - Save namespace ... 2016-05-02 17:04:50,014 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(203)) - Going to retain 2 images with txid >= 50371 2016-05-02 17:04:50,187 INFO namenode.TransferFsImage (TransferFsImage.java:setTimeout(443)) - Image Transfer timeout configured to 60000 milliseconds 2016-05-02 17:04:50,339 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2016-05-02 17:04:50,363 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(237)) - Uploaded image with txid 59225 to namenode at http://demo1.cloud.hortonworks.com:50070 in 0.222 seconds 2016-05-02 17:04:53,644 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2016-05-02 17:05:19,677 INFO ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(271)) - Triggering log roll on remote NameNode xomdemo1.cloud.hortonworks.com/172.24.64.97:8020 2016-05-02 17:05:21,042 INFO namenode.FSImage (FSImage.java:loadEdits(834)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6fba8d60 expecting start txid #59226
Created 05-03-2016 02:11 AM
You are right that it is not on the NN UI.
But you can get this from JMX (LastCheckpointTime which shows 1462223090012).
{ "name" : "Hadoop:service=NameNode,name=FSNamesystem", "modelerType" : "FSNamesystem", "tag.Context" : "dfs", "tag.HAState" : "standby", "tag.TotalSyncTimes" : "", "tag.Hostname" : "demo2.cloud.hortonworks.com", "MissingBlocks" : 0, "MissingReplOneBlocks" : 0, "ExpiredHeartbeats" : 0, "TransactionsSinceLastCheckpoint" : -8853, "TransactionsSinceLastLogRoll" : 0, "LastWrittenTransactionId" : 50372, "LastCheckpointTime" : 1462223090012, "CapacityTotal" : 44338987008, "CapacityTotalGB" : 41.0, "CapacityUsed" : 4014164154, "CapacityUsedGB" : 4.0, "CapacityRemaining" : 13009718052, "CapacityRemainingGB" : 12.0, "CapacityUsedNonDFS" : 27315104802, "TotalLoad" : 22, "SnapshottableDirectories" : 0, "Snapshots" : 0, "LockQueueLength" : 0, "BlocksTotal" : 1145, "NumFilesUnderConstruction" : 4, "NumActiveClients" : 4, "FilesTotal" : 1385, "PendingReplicationBlocks" : 0, "UnderReplicatedBlocks" : 0, "CorruptBlocks" : 0, "ScheduledReplicationBlocks" : 0, "PendingDeletionBlocks" : 0, "ExcessBlocks" : 0, "PostponedMisreplicatedBlocks" : 0, "PendingDataNodeMessageCount" : 4, "MillisSinceLastLoadedEdits" : 39983, "BlockCapacity" : 2097152, "StaleDataNodes" : 0, "TotalFiles" : 1385, "TotalSyncCount" : 0 }
You could also monitor Standby NN log for this. Below is the log from Standby NN during checkpoint.
2016-05-02 17:04:49,810 INFO ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(336)) - Triggering checkpoint because it has been 21600 seconds since the last checkpoint, which exceeds the configured interval 21600 2016-05-02 17:04:49,810 INFO namenode.FSImage (FSImage.java:saveNamespace(1090)) - Save namespace ... 2016-05-02 17:04:50,014 INFO namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(203)) - Going to retain 2 images with txid >= 50371 2016-05-02 17:04:50,187 INFO namenode.TransferFsImage (TransferFsImage.java:setTimeout(443)) - Image Transfer timeout configured to 60000 milliseconds 2016-05-02 17:04:50,339 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2016-05-02 17:04:50,363 INFO namenode.TransferFsImage (TransferFsImage.java:uploadImageFromStorage(237)) - Uploaded image with txid 59225 to namenode at http://demo1.cloud.hortonworks.com:50070 in 0.222 seconds 2016-05-02 17:04:53,644 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2016-05-02 17:05:19,677 INFO ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(271)) - Triggering log roll on remote NameNode xomdemo1.cloud.hortonworks.com/172.24.64.97:8020 2016-05-02 17:05:21,042 INFO namenode.FSImage (FSImage.java:loadEdits(834)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6fba8d60 expecting start txid #59226
Created 05-03-2016 05:15 PM
Thanks, that clears it up!
A follow up question: Is there a way to create a widget for Ambari telling me, how long ago the last checkpoint happened? I see, that I can create a widget showing me the LastCheckpointTime from JMX, but that number is not really intuitive to our Administrators.
Created 05-03-2016 05:25 PM
As far as I know, you cannot add custom conversion logic (like converting from epoch to date). But you might get more eyes watching and better results if you open a new question for this and tag it with ambari and ambari-metrics.