07-06-2017 02:33 AM - edited 07-06-2017 03:32 AM
I'm using CDH 5.9 (hbase-1.2.0+cdh5.9.0+205) and think I've found a heap leak when creating and deleting a table.
We're running an automated test that creates and deletes an hbase table every minute (and does other things too). We noticed that when running for a long time the regionserver java heap usage trends upwards at 190MB per day. The garbage collection never quite collects all the memory and the overall trend is up.
If I disable the part of the test that creates and deletes the table, and just uses one that is alreay created then the memory usage doesn't trend up any more.
If you look at the picture, you can see where I changed the test to stop the repeated table creation - where the up trend flattens out. And then the big drop off is where I restarted the regionserver.
The regionserver java heap contains an awful lot of counter objects:
1: 7770499 1056787864 org.apache.hadoop.hbase.util.Counter$Cell
2: 7770499 186491976 [Lorg.apache.hadoop.hbase.util.Counter$Cell;
3: 7770499 186491976 org.apache.hadoop.hbase.util.Counter$Container
4: 7770498 186491952 org.apache.hadoop.hbase.util.Counter
5: 7770498 186491952 org.apache.hadoop.hbase.util.Counter$1
6: 7814877 125038032 java.util.concurrent.atomic.AtomicBoolean
7: 7777721 124443536 java.util.concurrent.atomic.AtomicReference
Any hbase experts that could check this one out and confirm?
07-06-2017 11:18 PM
07-07-2017 01:20 AM
Can you give some more detail on what your test is doing?
I have run the following ruby code in the hbase shell and didn't see a memory leak:
create 't0', 'f0'
This ended up creating and deleting a new table roughly every second for me. Also possible that I didn't run this long enough to see, but I expected to see some growth quickly given my churn rate.
07-07-2017 01:21 AM
Not sure was to say.
The graphic do not show any issue (on its own).
That is expected that the memory increase overtime and will decrease only when the Garbage Collector trigger itself (when needed - and as per the configuration).
There would be an issue, only if the memory increase leads to an OutOfMemory OR to much CPU consume by the GC.
Are you experiencing that ?
07-07-2017 01:35 AM
The eventual issue is the regonserver shuts down because the GC takes too long and the zookeeper timeout (60 seconds) is triggered, this happens when the heap is almost 100% full at 4GB.
Is there a way to trigger the collection earlier? We're using the default CDH settings right now:
Java Configuration Options for HBase RegionServer:
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
07-07-2017 01:37 AM
I was using starbase and now the happybase python client:
It takes a good few hours to see the increase though.
07-07-2017 01:43 AM
Ok, So you are experiencing GC issue.
First I would investigate the "client" for any unclosed connections or equivalent that could explain the "leak".
Then only, I would investigate on the "server" side.
We have not run on that issue but we are not using the same CDH version and with an HBase configuration more powerful (more memory).
Good luck !
07-10-2017 02:43 AM
I repeated the test with the hbase shell, running this every minute. And saw the same behaviour.
create 't0', 'f0'
put 't0', 'r0', 'f0:c0', 'v0'
get 't0', 'r0', 'f0:c0'