Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to alert when datanode has a disk/volume failure?

avatar

We have multiple disk mounts on our datanode(24) and have set dfs.datanode.failed.volumes.tolerated to 2 so that the datanode process does not go down incase it encounters 1-2 disk failures. However, the problem is that we are not able to alert when we do encounter a disk failure(which needs to be there so that we are able to take action and replace it). How do I configure an alert in ambari to do this?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

It depends on how you want to monitor the failed disks. You can always write your own script alert in Python to monitor the various disks. However, if NameNode has a JMX metric which it exposes for this, you can also create a much simpler metric alert.

It seems like Hadoop:service=NameNode,name=NameNodeInfo/LiveNodes contains escaped JSON of every DataNode. Metrics can't help you there, but there is a simpler global failed volume metric:

Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal

You could try to use that metric to monitor failures. If either of these approaches sound feasible, I can try to point you in the right direction to creating the alert.

View solution in original post

8 REPLIES 8

avatar
Super Guru

@Vedant Biyani

Unfortunately, there is nothing in Ambari to help with monitor disk failures in the way you described. Usually, this is done with a different enterprise software, e.g. OpenView, BMC etc.

As you already mentioned, the failure tolerance for disks is configurable via dfs.datanode.failed.volumes.tolerated, but that marks all node as failed and that is a waste of space and time to rebalance data. It is good to know as soon as you have one drive failed.

If you can't use one of the specialized software to monitor disks, one workaround would be to set the "DataNode Health Summary" alert threshold that will alert you on the first data node.

=====

If this response is helpful, please vote and accept.

avatar

@Constantin Stanca

I do see the failed volume count in NN UI(under datanodes information page). If not through ambari, is there a dfsadmin command which can report on that so that we can go the custom script route.

I have my concerns with "DataNode Health Summary" especially because it will get triggered only when the datanode is down at which point we are not only looking at network contingency(for rebalance and replication) but also multiple disk replacements at the same time---a situation which can be avoided if we get notified at the first volume/disk failure itself.

avatar
Super Collaborator

It depends on how you want to monitor the failed disks. You can always write your own script alert in Python to monitor the various disks. However, if NameNode has a JMX metric which it exposes for this, you can also create a much simpler metric alert.

It seems like Hadoop:service=NameNode,name=NameNodeInfo/LiveNodes contains escaped JSON of every DataNode. Metrics can't help you there, but there is a simpler global failed volume metric:

Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal

You could try to use that metric to monitor failures. If either of these approaches sound feasible, I can try to point you in the right direction to creating the alert.

avatar

Yes, monitoring the VolumeFailuresTotal does sound feasible to me. Can you guide me on how an alert would work for it?

avatar
Super Collaborator

Sure, you'd need to execute a POST to create the new alert:

POST api/v1/clusters/<cluster-name>/alert_definitions

{
  "AlertDefinition": {
    "component_name": "NAMENODE",
    "description": "This service-level alert is triggered if the total number of volume failures across the cluster is greater than the configured critical threshold.",
    "enabled": true,
    "help_url": null,
    "ignore_host": false,
    "interval": 2,
    "label": "NameNode Volume Failures",
    "name": "namenode_volume_failures",
    "scope": "ANY",
    "service_name": "HDFS",
    "source": {
      "jmx": {
        "property_list": [
          "Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal"
        ],
        "value": "{0}"
      },
      "reporting": {
        "ok": {
          "text": "There are {0} volume failures"
        },
        "warning": {
          "text": "There are {0} volume failures",
          "value": 1
        },
        "critical": {
          "text": "There are {0} volume failures",
          "value": 1
        },
        "units": "Volume(s)"
      },
      "type": "METRIC",
      "uri": {
        "http": "{{hdfs-site/dfs.namenode.http-address}}",
        "https": "{{hdfs-site/dfs.namenode.https-address}}",
        "https_property": "{{hdfs-site/dfs.http.policy}}",
        "https_property_value": "HTTPS_ONLY",
        "kerberos_keytab": "{{hdfs-site/dfs.web.authentication.kerberos.keytab}}",
        "kerberos_principal": "{{hdfs-site/dfs.web.authentication.kerberos.principal}}",
        "default_port": 0,
        "connection_timeout": 5,
        "high_availability": {
          "nameservice": "{{hdfs-site/dfs.internal.nameservices}}",
          "alias_key": "{{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}",
          "http_pattern": "{{hdfs-site/dfs.namenode.http-address.{{ha-nameservice}}.{{alias}}}}",
          "https_pattern": "{{hdfs-site/dfs.namenode.https-address.{{ha-nameservice}}.{{alias}}}}"
        }
      }
    }
  }
}

This will create a new METRIC alert which runs every 2 minutes.

avatar

@Jonathan Hurley

Is this correct?

Hadoop:service=NameNode,name=FSNamesystemState/VolumeFailuresTotal

I just tried the query from my browser before creating a new alert and its returning an empty mbean.

{
  "beans" : [ ]
}

Meanwhile, if I use just Hadoop:service=NameNode,name=FSNamesystemState it is showing the expected mbean

{
  "beans" : [ {
    "name" : "Hadoop:service=NameNode,name=FSNamesystemState",
    "modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem",
    "CapacityTotal" : xxxxxxxxxx,
    "CapacityUsed" : xxxxxxxxxx,
    "CapacityRemaining" : xxxxxxxxxx,
    "TotalLoad" : xxxxxxxxxx,
    "SnapshotStats" : "{\"SnapshottableDirectories\":0,\"Snapshots\":0}",
    "FsLockQueueLength" : xxxxxxxxxx,
    "BlocksTotal" : xxxxxxxxxx,
    "MaxObjects" : xxxxxxxxxx,
    "FilesTotal" : xxxxxxxxxx,
    "PendingReplicationBlocks" : xxxxxxxxxx,
    "UnderReplicatedBlocks" : xxxxxxxxxx,
    "ScheduledReplicationBlocks" : xxxxxxxxxx,
    "PendingDeletionBlocks" : xxxxxxxxxx,
    "BlockDeletionStartTime" : xxxxxxxxxx,
    "FSState" : "Operational",
    "NumLiveDataNodes" : xxxxxxxxxx,
    "NumDeadDataNodes" : xxxxxxxxxx,
    "NumDecomLiveDataNodes" : xxxxxxxxxx,
    "NumDecomDeadDataNodes" : xxxxxxxxxx,
    "VolumeFailuresTotal" : 2,
    "EstimatedCapacityLostTotal" : xxxxxxxxxx,
    "NumDecommissioningDataNodes" : xxxxxxxxxx,
    "NumStaleDataNodes" : xxxxxxxxxx,
    "NumStaleStorages" : xxxxxxxxxx,
    "TopUserOpCounts" : "{\"timestamp\":\"2017-02-16T23:20:22+0000\",\"windows\":[{\"windowLenMs\":300000,\"ops\":[{\"opType\":\"*\",\"topUsers\":[{\"user\":\"hbase\",\"count\":11},{\"user\":\"mapred\",\"count\":5},{\"user\":\"yarn\",\"count\":4}],\"totalCount\":20},{\"opType\":\"listStatus\",\"topUsers\":[{\"user\":\"hbase\",\"count\":8},{\"user\":\"mapred\",\"count\":5},{\"user\":\"yarn\",\"count\":4}],\"totalCount\":17}]},{\"windowLenMs\":1500000,\"ops\":[{\"opType\":\"*\",\"topUsers\":[{\"user\":\"hbase\",\"count\":28},{\"user\":\"mapred\",\"count\":20},{\"user\":\"yarn\",\"count\":11}],\"totalCount\":59},{\"opType\":\"listStatus\",\"topUsers\":[{\"user\":\"hbase\",\"count\":26},{\"user\":\"mapred\",\"count\":20},{\"user\":\"yarn\",\"count\":11}],\"totalCount\":57},{\"opType\":\"getfileinfo\",\"topUsers\":[{\"user\":\"hbase\",\"count\":6}],\"totalCount\":6}]},{\"windowLenMs\":60000,\"ops\":[{\"opType\":\"*\",\"topUsers\":[{\"user\":\"hbase\",\"count\":1}],\"totalCount\":1},{\"opType\":\"listStatus\",\"topUsers\":[{\"user\":\"hbase\",\"count\":1}],\"totalCount\":1}]}]}",
    "TotalSyncCount" : 1,
    "TotalSyncTimes" : "3 8 8 "
  } ]
}

avatar
Super Collaborator

Yes, my example is correct. There is no way to query directly for a specific property; you can only query by bean name. However, for alerts, we use a slash as a delimiter. The metric alert will remove the "VolumeFailuresTotal" and retrieve the "Hadoop:service=NameNode,name=FSNamesystemState" bean. Then it will extract the "VolumeFailuresTotal" metric.

avatar
Explorer

You could use telegraf as metrics collector and sink the data nodes (volume information) metrics (also all master services metrics if need be) to graphite. Then, you can use graphite as a data source in grafana to alert on volume failures. This is an end to end enterprise solution if you want to start monitoring your cluster. Telegraf provided out of the box solution to monitor host level services as well.