Created on 04-15-2018 11:36 AM
In my articles part 1 and part 2 , I explained various parameters which could be tuned for achieving optimized performance from Hbase. In part 3, we discussed some scenarios and aspects to focus while investigating performance issues. Continuing this series, part 4 covers some system and network level investigations.
Along with investigating potential issues at HBASE and HDFS layer, we must not ignore system side of thingslike OS, network and disks. We see several cases everyday when severe issues at this layer are identified.Detailed investigation is beyond the scope of this article , but we should know where to point fingers. The triggering factor to look at system side are messages such as following in datanode logs at the time of performance issue.
WARN datanode.DataNode (BlockReceiver.java:receivePacket(694)) - Slow BlockReceiver write data to disk cost:317ms (threshold=300ms)
Following are some of several tests we could do to ascertain disk performance:
- Run dd test to check read and write throughput and latencies.
For checking write throughput:
dd bs=1M count=10000 if=/dev/zero of=/data01/test.img conv=fdatasync
For checking read throughput:
dd if=/data01/test.img of=/dev/null bs=1M count=10000
Where /data is one of your data node data disk.
- For checking latencies either during read or write, prepend “time” command before above commands and you will know how much time it took to complete these operations and also if the actual delays were from user side or system side. Compare these results with the agreed upon throughputs with your storage vendor / Cloud service providers.
- Another important tool is Linux “ iostat” command which provides great deal of advanced information to diagnose such as how much time an IO request was in IO scheduler queue, disk controller queue, how many requests were waiting in queues, how much time disk took to complete an IO operation etc.
- This command could very well explain if your work load is way beyond your disk capacities or if your disks have issues either at hardware or driver / firmware level.
Another detailed article could be written to explain each and every parameter specified in this command but that’s beyond the scope of this article, some parameters though need highlighted:
A. Await: Covers the time that is taken through scheduler, driver, controller, transport (for example fibre san), and storage needed to complete each IO. Await is the average time, in milliseconds, for I/O requests completed by storage and includes the time spent by the requests in the scheduler queue and time spent by storage servicing them.
B.. avgqu-sz : the average number of IO queuedwithin both the IO scheduler queue and storage controller queue.
C. Svctm : Actual service time storage / disk took to serve IO request excluding all queue latencies.
D. Util : Percentage utilization of each disk.
- Needless to say, you would always check commands like top / vmstat/mpstat for identifying issues related to CPU / Memory / Swapping etc.
- Last but most important command to see live streaming of whats happening at your IO layer is “iotop” command.This command would give you real time details of which command , process and user is actually clogging your disks.
Some general tuning tips :
- Selecting right IO scheduler is very critical to latency sensitive work loads. “deadline” scheduler is proven to be the best scheduler for such use cases.Check and correct which scheduler you are getting your IO processed with:
[email@example.com hbase]# cat /sys/block/sd*/queue/scheduler noop [deadline] cfq
- Choose right mount options for your data disks. Options such as “noatime” saves a great deal of IO overhead on data disks and in turn improve their performance.
- Check which mode your CPU cores are running in. We recommend them to run in performance mode. Virtual machines and cloud instances may not have this file.
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
- Flushing of large pool of accumulated “dirty pages” to disks has been seen to be causing a significant IO overhead on systems. Please tune the following kernel parameters controlling this behavior. There is no single number which is suited for all and trial and error is a best resort here, but with systems having large amount of memory, we can keep this ratio smaller than their default values so that we dont end up accumulating a huge pool of dirty pages in memory eventually to be burst synced to disks with limited capacities and degrading application performance.
vm.dirty_background_ratio (default is 10)
vm.dirty_ratio (default is 40 )
Read more about this parameter here :
Network bandwith across the nodes of a HDP cluster plays critical role in heavy read / write use cases. It becomes further critical in distributed computing because any one limping node is capable of degrading entire cluster performance. While we are already in the world of gigabit networks and generally things are stable at this front. However we continue to see issues this side, messages such as the following seen in datanode logs could prove to be the triggering factor to investigate network side of things:
WARN datanode.DataNode (BlockReceiver.java:receivePacket(571)) - Slow BlockReceiver write packet to mirror took 319ms (threshold=300ms)
Following are some of the tools / commands we can use to find out if something is wrong here:
- “iperf” to test network bandwidth between nodes. See more details about iperf here.
- Use Linux commands like ping / ifconfig and “netstat -s” to find out there are any significant packet drops / socket buffer overruns and if this number is increasing over time.
-ethtool ethX command would help you provide negotiated network bandwidth.
- ethtool -S would help collect NIC and driver statistics.
- Generally, any sort of NIC level receive acceleration does not work well with our use cases and in turn prove to be a performance bottleneck in most scenarios. Disable any acceleration enabled on your NIC cards ( of course after consultation with your platform teams) :
- Check if receive offloading is enabled:
$ grep 'receive-offload' sos_commands/networking/ethtool_-k_eth0 | grep ': on'
- Disable them using following commands:
# ethtool -K eth0 gro off
# ethtool -K eth0 lro off
- Increase socket buffer sizes if you observe consistent overruns / prunes / collapses of packets as explained above. Consult your network and platform teams as how to tweak these values.
A tuned kernel is a mandatory requirement for the nature of work load you are expecting any node to process. However it has been seen that kernel tuning is often ignored at the time of design of such infrastructures. Although its a very vast topic to cover and is beyond the scope of this article, I will mention some important kernel parameters related to memory management which must be tuned on HDP cluster nodes. These configuration parameters stay in /etc/sysctl.conf
- vm.min_free_kbytes: Kernel tries to ensure that min_free_kbytes of memory is always available on the system. To achieve this the kernel will reclaim memory. Keeping this parameter to be about 2 - 5 % of total memory on node makes sure that your applications do not suffer due to prevailing memory fragmentation.
The first symptom of memory fragmentation is the appearance of message such as “Page Allocation Failures” in /var/log/messages or “dmesg” (kernel ring buffer) or worst, when kernel starts killing processes to free up memory by “OOM Killer”.
- vm.swappiness : This is a parameter reflecting tendency of a system to swap. Default value is 60. We dont want system to swap on its will , so keep this value to about 0 - 5 to keep system's swap tendencies to be minimal.
- It has been seen that transparent hugepages do not work well with the kind of work load we have. Its thus recommended to disable THP on our cluster nodes.
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
- On modern NUMA (read here about NUMA ) systems , we strongly recommend to disable zone reclaim mode. This is based on understanding that performance penalties incurred due to reclaiming pages from within a zone are far worse than having the requested page served from another zone. Usually applications strongly dependent on using cache prefer having this parameter disabled.
All these kernel level changes can be made on running systems by editing /etc/sysctl.conf and running sysctl -p command to bring them into effect.
In last part (PART 5) of this article series , I will discuss Phoenix performance tuning in details.