Community Articles

VidyaSargur · ‎04-15-2018

As we understood important tuning parameters of Hbase in part 1 and part 2 of this article series, this article focuses on various areas which should be investigated when handling any Hbase performance issue.

Locality

By locality we mean the physical HDFS blocks related to Hbase Hfiles need to be local to the region server node where this respective region is online. This locality is important because Hbase prefers to use short circuit reads directly from physical disks bypassing HDFS. If a region’s Hfiles are not local to it, it will incur cost to a read latency by doing reads across the node over network.

One can monitor locality of a table / region from Hbase Master UI specifically on table’s page by clicking on the listed table name. The value of “1”on the locality column means 100% block locality. Overall Hbase locality is visible on Ambari metrics section of Hbase ( in percentage).

Here, Major compaction tries to bring back all Hfiles related to a region on a single region server thus restoring locality to a great extent.

Locality is generally messed up due to balancer run by HDFS which tries to balance disk space across data nodes OR by Hbase balancer which tries to move regions across region server nodes to balance the number of regions on each server.

Hbase balancer (default is Stochastic Load Balancer ) can be tuned by tweaking various costs ( region load, table load, data locality, MemStore sizes, store file sizes) associated with it and have it run according to our requirements, for example , to have balancer prefer Locality cost more than anything else , we can add following parameter in hbase configs and give it a higher value. (Default value is 25).

( an advanced parameter and an expert must be consulted before such an addition.)

hbase.master.balancer.stochastic.localityCost

To overcome locality harm done by HDFS balancer we have no solution as of date except running compaction immediately after HDFS balancer is run. There are some unfinished JIRAs which once implemented would bring in features like block pinning and favored nodes , once they are available Hbase can configure its favored nodes and writes would be dedicated only to those nodes and HDFS balancer won’t be able to touch its respective blocks. (Refer HBASE-15531 to see all unfinished work on this feature )

Hotspotting

Hotspotting has been discussed quiet a lot but its important to mention it here as its a very crucial aspect to be investigated during performance issues. Basically hotspotting appears when all your write traffic is hitting only on a particular region server. And this might have happened because of the row key design which might be sequential in nature and due to that all writes get landed to a node which has this hot spotted region online.

We can come over this problem using three ways (I mean I know three ways) :

Use random keys - Not so ideal solution as it would not help in range scans with start and stop keys
Use Salt buckets - If you have Phoenix tables on top of hbase tables, use this feature. Read more about this here
Use Pre-splitting - If you know the start and end keys of your sequential keys, you can pre split the table by giving split key points beforehand at the time of creation. This would distribute empty regions across nodes and whenever writes come on for a particular key it would get landed to respective node eventually distributing the write traffic across nodes.Read more about here.

HDFS

HDFS layer is very important layer as no matter how optimized your Hbase is, if datanodes are not responding as expected you would not get performance as expected. Anytime you have latencies on Hbase / Phoenix queries and you observe following messages in region server logs in a large number :

2018-03-29 13:49:20,903 INFO [regionserver/example.com/10.9.2.35:16020] wal.FSHLog: Slow sync cost: 18546 ms, current pipeline: [xyz]

OR

Following messages in datanodes logs at the time of query run:

2018-04-12 08:22:51,910 WARN datanode.DataNode (BlockReceiver.java:receivePacket(571)) - Slow BlockReceiver write packet to mirror took 34548ms (threshold=300ms)

OR

2018-04-12 09:20:57,423 WARN datanode.DataNode (BlockReceiver.java:receivePacket(703)) - Slow BlockReceiver write data to disk cost:3440 ms (threshold=300ms)

If you see such messages in your logs, Its time to investigate things from HDFS side such as if we have sufficient datanode transfer threads , heap , file descriptors, checking logs further to see if there are any GC or Non GC pauses etc. Once confirmed on HDFS side we must also look at underlying infrastructure side (network, disk, OS). This is because these messages mostly convey that HDFS is having hard time receiving / transferring block from/to another node or to sync the data to disk. We will discuss about system side of things in part 4 of this article series.

BlockCache Utilization and hitRatio

When investigating performance issues for read traffic, its worth checking how much your Block Cache and Bucket cache are helpful and whether they are getting utilized or not.

2017-06-12 19:01:48,453 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=33.67 MB, freeSize=31.33 GB, max=31.36 GB, blockCount=7, accesses=4626646, hits=4626646, hitRatio=100.00%, , cachingAccesses=4625995, cachingHits=4625995, cachingHitsRatio=100.00%, evictions=24749, evicted=662, evictedPerRun=0.026748554781079292 
2017-06-12 19:02:07,429 INFO [BucketCacheStatsExecutor] bucket.BucketCache: failedBlockAdditions=0, totalSize=46.00 GB, freeSize=45.90 GB, usedSize=106.77 MB, cacheSize=93.21 MB, accesses=9018587, hits=4350242, IOhitsPerSecond=2, IOTimePerHit=0.03, hitRatio=48.24%, cachingAccesses=4354489, cachingHits=4350242, cachingHitsRatio=99.90%, evictions=0, evicted=234, evictedPerRun=Infinity

Flush Queue / Compaction queue

During the crisis hours when you are facing severe write latencies ,its very important to check how memstore flush queue and compaction queue look like. Lets discuss couple of scenarios here and some possible remedies here (need expert consultation )

A. Flush queue not reducing: This leads us to three additional possibilities :

A.1 Flushes have been suspended for some reason , one such reason could be a condition called “too many store files” seen somewhere down in region server logs (dictated by hbase.hstore.blockingStoreFiles). Check my part 2 article to know more about this parameter and how to tune it. Simply put, this parameter blocks flushing temporarily till minor compaction is completed on existing Hfiles. May be increasing this number few folds at the time of heavy write load should help.

Here , we can even help minor compaction by assigning it more threads so that it finishes compaction of these files faster:

hbase.regionserver.thread.compaction.small (default value is 1 , we can tune it to say 3 )

A.2 Another possibility could be that flusher operation itself is slow and not able to cope up with write traffic which lead to slow down of flushes. We can help this flusher by allocating few more handler threads using :

hbase.hstore.flusher.count (default value is 2, we can bump it to say 4 )

A.3 There is another possibility seen in such cases and which is of “flush storm” , this behavior triggers when number of Write ahead log files reach their defined limit (hbase.regionserver.maxlogs) and and region server is forced to trigger flush on all memstores until WAL files are archived and enough room is created to resume write operations. You would see messages like:

2017-09-23 17:43:49,356 INFO[regionserver//10.22.100.5:16020.logRoller] wal.FSHLog:Too many wals: logs=35, maxlogs=32; forcing flush of 20 regions(s): d4kjnfnkf34335666d03cb1f

Such behaviors could be controlled by bumping up:

hbase.regionserver.maxlogs (default value is 32, double or triple up this number if you know you have a heavy write load )

B.Compaction queue growing : compaction_queue=0:30 —> ( meaning 0 major compactions and 30 minor compactions in queue) . Please note that compaction whether minor or major is an additional IO overhead on system, so whenever you are trying to fix a performance problem by making compactions faster or accommodating more Hfiles in one compaction thread, you must remember that this medicine itself has its own side effects. Nevertheless, we can tune to make compaction more efficient by bumping up following parameters:

Lower bound on number of files in any minor compaction

hbase.hstore.compactionThreshold : ( Default3 )

Upper bound on number of files in any minor compaction.

hbase.hstore.compaction.max ( Default10 )

The number of threads to handle a minor compaction.

hbase.regionserver.thread.compaction.small ( Default1 )

The number of threads to handle a major compaction.

hbase.regionserver.thread.compaction.large ( Default => 1 )

JVM metrics

Various performance issues drill down to JVM level issues , most specifically related togarbage collection STW (stop the world) pauses which bring down application / region server performance or some times brings them down to halt / crash.

There are several possibilities under which you can have long GC pauses and I wont be able to consider them all here. But the least you can do is have your region server’s gc.log file analyzed by one of many online tools such as this , this tool would help you analyze what’s going wrong in your JVM or garbage collection behavior specifically.

Also , if you have huge number of regions on every region server (500 + ) , consider moving to G1 GC algorithm, even though Hortonworks does not officially support it, but we are in process of it and many of our customers have implemented it successfully.

In your spare time, also go through this GC tuning presentation.

Without going into too many details, I would like to mentionfew most basic thumb rules:

With CMS GC algorithm , never set region server heap greater than. 36 - 38 GB, if you have more requirements, switch to G1GC.
Young generation should never be more than 1/8th or 1/10 th of total region server heap.
Start your first tweak in reducing GC pauses by changing -XX:ParallelGCThreads , which is 8 by default and you can safely take it till 16 (watch the number of CPU cores you have though )
Check who contributed to GC pause “user” or “sys” or “real”

2017-10-11T14:06:17.492+0530: 646872.162: [GC [1 CMS-initial-mark: 17454832K(29458432K)] 20202988K(32871808K), 74.6880980 secs] [Times: user=37.61 sys=1.96, real=74.67 secs]

‘real’ time is the total elapsed time of the GC event. This is basically the time that you see in the clock.

‘user’ time is the CPU time spent in user-mode(outside the kernel).

‘Sys’ time is the amount of CPU time spent in the kernel. This means CPU time spent in system calls within the kernel.

Thus in above scenario, “sys” time is very small but still “real” time is very high indicating that GC did not get enough CPU cycles as it needed indicating a heavily resource clogged system from CPU side.

There is another category of pause which we see regularly:

2017-10-20 14:03:42,327 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 9056ms 
No GCs detected

This category of Pause indicate that JVM was frozen for about 9 seconds without any GC event. This situation mostly indicate a problem at physical machine side possibly indicating an issue at Memory / CPU or any other OS issue which is causing whole JVM to freeze momentarily.

In part 4 of this series, I will cover some infrastructure level investigation of performance issues.

Cloudera Community