- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
HBase creating empty regions
Created ‎02-03-2016 09:27 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Our table has one column family which has a TTL of 15day. So rows fall at a consistent basis. We are seeing that the no of regions are going up. Some how the regions are not getting re-used. We are currently with over 41k regions of which 19k are empty. Any pointers as to why regions are not getting reused. Our row key design is similar to what you mentioned. 2 digit salt (0 to 63) followed by hashcode of reverse time stamp in seconds, followed by a service id and finally a counter.
Our cluster is 41 nodes and we are writing rows at a rate of 4k to 6k Tps. The average row size is about 35kb.
Created ‎02-03-2016 09:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HBase regions are defined by a start and end rowkey. Only data that falls within that range is stored in that region.
I would guess that you are constantly inserting new data into the beginning of each "bucket" (defined by your 2 salt characters). Eventually, this region will get so large, that it will split into two daughters. One will contain the "top" half of the data and the other the "bottom". After that, all of the new data will be inserted into the "upper" daughter region. Then, the process repeats. (the same analogy works if you're appending data into the "buckets" instead of inserting at the head).
You likely will want to use the `merge_region` HBase shell command to merge away some of the older regions. You could also set the split threshold very high and prevent any future splits, but this would likely have a negative impact on performance (or you would need to drastically increase the number of buckets -- more salt digits -- to spread out the load evenly).
Created ‎02-03-2016 09:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HBase regions are defined by a start and end rowkey. Only data that falls within that range is stored in that region.
I would guess that you are constantly inserting new data into the beginning of each "bucket" (defined by your 2 salt characters). Eventually, this region will get so large, that it will split into two daughters. One will contain the "top" half of the data and the other the "bottom". After that, all of the new data will be inserted into the "upper" daughter region. Then, the process repeats. (the same analogy works if you're appending data into the "buckets" instead of inserting at the head).
You likely will want to use the `merge_region` HBase shell command to merge away some of the older regions. You could also set the split threshold very high and prevent any future splits, but this would likely have a negative impact on performance (or you would need to drastically increase the number of buckets -- more salt digits -- to spread out the load evenly).
Created ‎02-03-2016 10:14 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Merge 19k regions manually is a PTA @Josh Elser
Created ‎02-03-2016 10:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is a new feature called "region normalizer" that runs inside master and can do the merges automatically. it is available (but turned off by default) in HDP-2.3+. However, it is a very new feature, so it is still in tech preview mode I would say.
Created ‎02-03-2016 10:26 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yep looking forward to trying it @Enis
Created ‎02-04-2016 06:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Would there be a negative impact if the slat was say 5 digits?
How does increasing the split file size from 10GB to 40GB or more say 100G affect performance?
If you have 12 disks(4TB each) per node in a 40 node cluster and you want to store 400TB would you say (400*1024)/(40*10) ~ 1024 regions with 10GB file would be better or 103 regions with 100GB files.
Created ‎02-04-2016 08:02 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should aim for having at most 10-20GB regions. Anything above that would cause compactions to increase the write-amplification and have <1000 regions per server.
I suggest looking into your data model and make necessary adjustments to make sure that you are not doing the anti-pattern of incremental writes. See https://hbase.apache.org/book.html#table_schema_rules_of_thumb.
