About vedantbiyani

vedantbiyani · ‎02-16-2017

@Jonathan Hurley Is this correct? Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal I just tried the query from my browser before creating a new alert and its returning an empty mbean. { "beans" : [ ] } Meanwhile, if I use just Hadoop:service=NameNode,name=FSNamesystemState it is showing the expected mbean { "beans" : [ { "name" : "Hadoop:service=NameNode,name=FSNamesystemState", "modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem", "CapacityTotal" : xxxxxxxxxx, "CapacityUsed" : xxxxxxxxxx, "CapacityRemaining" : xxxxxxxxxx, "TotalLoad" : xxxxxxxxxx, "SnapshotStats" : "{\"SnapshottableDirectories\":0,\"Snapshots\":0}", "FsLockQueueLength" : xxxxxxxxxx, "BlocksTotal" : xxxxxxxxxx, "MaxObjects" : xxxxxxxxxx, "FilesTotal" : xxxxxxxxxx, "PendingReplicationBlocks" : xxxxxxxxxx, "UnderReplicatedBlocks" : xxxxxxxxxx, "ScheduledReplicationBlocks" : xxxxxxxxxx, "PendingDeletionBlocks" : xxxxxxxxxx, "BlockDeletionStartTime" : xxxxxxxxxx, "FSState" : "Operational", "NumLiveDataNodes" : xxxxxxxxxx, "NumDeadDataNodes" : xxxxxxxxxx, "NumDecomLiveDataNodes" : xxxxxxxxxx, "NumDecomDeadDataNodes" : xxxxxxxxxx, "VolumeFailuresTotal" : 2, "EstimatedCapacityLostTotal" : xxxxxxxxxx, "NumDecommissioningDataNodes" : xxxxxxxxxx, "NumStaleDataNodes" : xxxxxxxxxx, "NumStaleStorages" : xxxxxxxxxx, "TopUserOpCounts" : "{\"timestamp\":\"2017-02-16T23:20:22+0000\",\"windows\":[{\"windowLenMs\":300000,\"ops\":[{\"opType\":\"*\",\"topUsers\":[{\"user\":\"hbase\",\"count\":11},{\"user\":\"mapred\",\"count\":5},{\"user\":\"yarn\",\"count\":4}],\"totalCount\":20},{\"opType\":\"listStatus\",\"topUsers\":[{\"user\":\"hbase\",\"count\":8},{\"user\":\"mapred\",\"count\":5},{\"user\":\"yarn\",\"count\":4}],\"totalCount\":17}]},{\"windowLenMs\":1500000,\"ops\":[{\"opType\":\"*\",\"topUsers\":[{\"user\":\"hbase\",\"count\":28},{\"user\":\"mapred\",\"count\":20},{\"user\":\"yarn\",\"count\":11}],\"totalCount\":59},{\"opType\":\"listStatus\",\"topUsers\":[{\"user\":\"hbase\",\"count\":26},{\"user\":\"mapred\",\"count\":20},{\"user\":\"yarn\",\"count\":11}],\"totalCount\":57},{\"opType\":\"getfileinfo\",\"topUsers\":[{\"user\":\"hbase\",\"count\":6}],\"totalCount\":6}]},{\"windowLenMs\":60000,\"ops\":[{\"opType\":\"*\",\"topUsers\":[{\"user\":\"hbase\",\"count\":1}],\"totalCount\":1},{\"opType\":\"listStatus\",\"topUsers\":[{\"user\":\"hbase\",\"count\":1}],\"totalCount\":1}]}]}", "TotalSyncCount" : 1, "TotalSyncTimes" : "3 8 8 " } ] }

vedantbiyani · ‎02-16-2017

Yes, monitoring the VolumeFailuresTotal does sound feasible to me. Can you guide me on how an alert would work for it?

vedantbiyani · ‎02-15-2017

@Constantin Stanca I do see the failed volume count in NN UI(under datanodes information page). If not through ambari, is there a dfsadmin command which can report on that so that we can go the custom script route. I have my concerns with "DataNode Health Summary" especially because it will get triggered only when the datanode is down at which point we are not only looking at network contingency(for rebalance and replication) but also multiple disk replacements at the same time---a situation which can be avoided if we get notified at the first volume/disk failure itself.

vedantbiyani · ‎02-15-2017

We have multiple disk mounts on our datanode(24) and have set dfs.datanode.failed.volumes.tolerated to 2 so that the datanode process does not go down incase it encounters 1-2 disk failures. However, the problem is that we are not able to alert when we do encounter a disk failure(which needs to be there so that we are able to take action and replace it). How do I configure an alert in ambari to do this?

Online	Offline
Last Visited	‎04-25-2017 11:51 PM

Member Since	‎02-15-2017 08:52 PM
Last Visited	‎04-25-2017 11:51 PM
Posts	5
Kudos received	1

Cloudera Community

Re: How to alert when datanode has a disk/volume f...

Re: How to alert when datanode has a disk/volume f...

Re: How to alert when datanode has a disk/volume f...

How to alert when datanode has a disk/volume failu...