Device Behavior Analytics allows administrators to detect straggler nodes/disks which reduce the cluster performance. Finding and fixing these nodes can improve the overall cluster throughput. This feature will be available in HDP-2.6.1 and later. There are two parts to Device Behavior Analytics – Slow Datanode Detection and Slow Disk Detection.
SLOW DATANODE (PEER) DETECTION
Datanode will collect latency statistics about their peers during the normal operation of the datanode write pipeline. These latency stats are used to detect outliers among the peers. Slow peer detection is not performed unless the datanode has statistics of at least 10 peers. Namenode maintains the list of slow peers and administrators can read it via JMX. Datanodes also expose the average write latency of their peers through Datanode JMX.
SLOW DISK DETECTION
Each datanode will collect I/O statistics from all its disks. We can configure the percentage of file I/O events to be sampled to limit the performance impact of I/O profiling. Slow disk detection is not performed unless the datanode has at least 5 disks. Slow disk information is available via Datanode JMX. Namenode also exposes the slowest disks in the cluster via Namenode JMX.
ENABLING DEVICE BEHAVIOR ANALYTICS
To enable Device Behavior Analytics, the following configurations must be set.
Set to true to enable Slow Datanode Detection.
Set to a value between 1 and 100 to enable Slow Disk detection. This setting controls the fraction of file IO events which will be sampled for profiling disk statistics. Setting this value to 100 will sample all disk IO. Sampling a large fraction of disk IO events might have a small performance impact.
This setting allows you to control how frequently datanodes will report their peer latencies to the Namenode via heartbeat and the frequency of disk outlier detection by the datanode. The default value for this setting is 30 minutes.
These settings should be added to hdfs-site.xml. If it is an Ambari installed cluster, then the settings can be added via custom hdfs-site.xml.
SAMPLE JMX OUTPUTS
Sample Namenode JMX output reporting slow nodes
Sample Datanode JMX output showing average write latency of peers
Sample Datanode JMX output showing slow disks
The Datanode JMX output above reports disk3 and disk4 as outliers for the Datanode. The JMX also reports the latencies and number of operations per volume in another metric. The sample JMX output below for DataNodeVolume information for disk3 shows high average latencies for metadata and write IO operations.
Sample Namenode JMX output reporting slow disks and their latencies
Please follow the blog post link
here for a detailed explanation about Device Behavior Analytics.