Member since
02-15-2017
5
Posts
1
Kudos Received
0
Solutions
02-16-2017
11:27 PM
@Jonathan Hurley Is this correct? Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal I just tried the query from my browser before creating a new alert and its returning an empty mbean. {
"beans" : [ ]
} Meanwhile, if I use just Hadoop:service=NameNode,name=FSNamesystemState it is showing the expected mbean {
"beans" : [ {
"name" : "Hadoop:service=NameNode,name=FSNamesystemState",
"modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem",
"CapacityTotal" : xxxxxxxxxx,
"CapacityUsed" : xxxxxxxxxx,
"CapacityRemaining" : xxxxxxxxxx,
"TotalLoad" : xxxxxxxxxx,
"SnapshotStats" : "{\"SnapshottableDirectories\":0,\"Snapshots\":0}",
"FsLockQueueLength" : xxxxxxxxxx,
"BlocksTotal" : xxxxxxxxxx,
"MaxObjects" : xxxxxxxxxx,
"FilesTotal" : xxxxxxxxxx,
"PendingReplicationBlocks" : xxxxxxxxxx,
"UnderReplicatedBlocks" : xxxxxxxxxx,
"ScheduledReplicationBlocks" : xxxxxxxxxx,
"PendingDeletionBlocks" : xxxxxxxxxx,
"BlockDeletionStartTime" : xxxxxxxxxx,
"FSState" : "Operational",
"NumLiveDataNodes" : xxxxxxxxxx,
"NumDeadDataNodes" : xxxxxxxxxx,
"NumDecomLiveDataNodes" : xxxxxxxxxx,
"NumDecomDeadDataNodes" : xxxxxxxxxx,
"VolumeFailuresTotal" : 2,
"EstimatedCapacityLostTotal" : xxxxxxxxxx,
"NumDecommissioningDataNodes" : xxxxxxxxxx,
"NumStaleDataNodes" : xxxxxxxxxx,
"NumStaleStorages" : xxxxxxxxxx,
"TopUserOpCounts" : "{\"timestamp\":\"2017-02-16T23:20:22+0000\",\"windows\":[{\"windowLenMs\":300000,\"ops\":[{\"opType\":\"*\",\"topUsers\":[{\"user\":\"hbase\",\"count\":11},{\"user\":\"mapred\",\"count\":5},{\"user\":\"yarn\",\"count\":4}],\"totalCount\":20},{\"opType\":\"listStatus\",\"topUsers\":[{\"user\":\"hbase\",\"count\":8},{\"user\":\"mapred\",\"count\":5},{\"user\":\"yarn\",\"count\":4}],\"totalCount\":17}]},{\"windowLenMs\":1500000,\"ops\":[{\"opType\":\"*\",\"topUsers\":[{\"user\":\"hbase\",\"count\":28},{\"user\":\"mapred\",\"count\":20},{\"user\":\"yarn\",\"count\":11}],\"totalCount\":59},{\"opType\":\"listStatus\",\"topUsers\":[{\"user\":\"hbase\",\"count\":26},{\"user\":\"mapred\",\"count\":20},{\"user\":\"yarn\",\"count\":11}],\"totalCount\":57},{\"opType\":\"getfileinfo\",\"topUsers\":[{\"user\":\"hbase\",\"count\":6}],\"totalCount\":6}]},{\"windowLenMs\":60000,\"ops\":[{\"opType\":\"*\",\"topUsers\":[{\"user\":\"hbase\",\"count\":1}],\"totalCount\":1},{\"opType\":\"listStatus\",\"topUsers\":[{\"user\":\"hbase\",\"count\":1}],\"totalCount\":1}]}]}",
"TotalSyncCount" : 1,
"TotalSyncTimes" : "3 8 8 "
} ]
}
... View more
02-16-2017
06:35 PM
Yes, monitoring the VolumeFailuresTotal does sound feasible to me. Can you guide me on how an alert would work for it?
... View more
02-15-2017
10:19 PM
@Constantin Stanca I do see the failed volume count in NN UI(under datanodes information page). If not through ambari, is there a dfsadmin command which can report on that so that we can go the custom script route. I have my concerns with "DataNode Health Summary" especially because it will get triggered only when the datanode is down at which point we are not only looking at network contingency(for rebalance and replication) but also multiple disk replacements at the same time---a situation which can be avoided if we get notified at the first volume/disk failure itself.
... View more
02-15-2017
08:52 PM
1 Kudo
We have multiple disk mounts on our datanode(24) and have set dfs.datanode.failed.volumes.tolerated to 2 so that the datanode process does not go down incase it encounters 1-2 disk failures. However, the problem is that we are not able to alert when we do encounter a disk failure(which needs to be there so that we are able to take action and replace it). How do I configure an alert in ambari to do this?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hadoop