Created 04-29-2016 05:04 PM
I notice following line in my region server logs - 2016-04-27 12:11:11,924 WARN [MemStoreFlusher.1] regionserver.CompactSplitThread: Total number of regions is approaching the upper limit 1000. Please consider taking a look at http://hbase.apache.org/book.html#ops.regionmgt
And also -
2016-04-27 16:31:47,799 INFO [regionserver54130] regionserver.HRegionServer: Waiting on 4007 regions to close
This is surprising because I do not have as much data. Given the default value of hbase.hregion.max.filesize is 10G, this would imply 40TB of data. That is not even the size of my disks put together.
Does this mean there are many empty regions getting created? If so, why? Is there any performance implication to carrying these empty regions around? Definitely, one of them is that so many file descriptors are used up? Can I get rid of them?
Created 04-29-2016 08:35 PM
let's add that 10GB sizes are not fixed sizes since the default algorithm used is IncreasingToUpperBoundRegionSplitPolicy and not ConstantSizeRegionSplitPolicy (but you can set the latest by altering table from HBase shell for example). This means that you can't have an estimate of HDFS size by doing simple maths from regions number and region size parameter.
Regardless of the policy being used, a 10GB region which just split doesn't gives you two 10GB regions.
Created 04-29-2016 06:26 PM
You can check how many regions you have from the HBase master's web UI. A good rule of thumb is to keep number of regions per regionserver to be under 1000. You can also inspect the start and end keys of regions and regions sizes from the Master web UI or by going in to a RegionServers WebUI and checking various tabs.
HBase splits the regions based on range boundaries of the keyspace. HBase ALWAYS does range-splitting, not hash-based splitting. This means that depending on your key design, you maybe temporarily hotspotting some parts of the keyspace causing excessive region splits. It is likely that you have a timeseries based key design that you have to revisit. You can check out the HBase book, and there are also presentations available that talks about row key and schema design.
Created 04-30-2016 07:00 AM
@Enis - I have salted reowkeys so am hopeful that the region servers should not hotspot.
Created 04-29-2016 08:35 PM
let's add that 10GB sizes are not fixed sizes since the default algorithm used is IncreasingToUpperBoundRegionSplitPolicy and not ConstantSizeRegionSplitPolicy (but you can set the latest by altering table from HBase shell for example). This means that you can't have an estimate of HDFS size by doing simple maths from regions number and region size parameter.
Regardless of the policy being used, a 10GB region which just split doesn't gives you two 10GB regions.
Created 04-30-2016 07:02 AM
@Laurent Edel - Thanks, I did not think about the fact that splitting does not always create two 10G regions.
I am using hbase 0.98. So, if I were to set ConstantSizeRegionSplitPolicy through hbase shell, then I can assume them to always be 10G in size?
Created 05-02-2016 07:41 AM
@Sumit Nigam yes, well, it will split on 10G size. That means you won't have 10G regions at the end, that just means that regions will split when reaching 10G size...