Support Questions

chathurip · ‎09-05-2017

Hello,

I have a daily data loading process where we insert data in to HBase. Recently it started giving me errors saying 'org.apache.hadoop.hbase.RegionTooBusyException' but the job is progressing and I can see data in the table. That error makes daily loading very slow.

2017-09-01 07:38:10,285 INFO [hconnection-0x338b180b-shared--pool1-t72] org.apache.hadoop.hbase.client.AsyncProcess: #2, table=tweetTable-2017-08, attempt=10/35 failed=1455ops, last exception: org.apache.hadoop.hbase.RegionTooBusyException: org.apache.hadoop.hbase.RegionTooBusyException: Above memstore limit, regionName=tweetTable-2017-08,,1504119689670.257e69e222e3577c8b96ec34572f4aa8., server=moe-cn05,60020,1504207500995, memstoreSize=2222782950, blockingMemStoreSize=2147483648
at org.apache.hadoop.hbase.regionserver.HRegion.checkResources(HRegion.java:3657)
at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2867)
at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2818)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:751)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:713)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2142)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33656)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:185)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:165)
on moe-cn05,60020,1504207500995, tracking started null, retrying after=10054ms, replay=1455ops

Any help is really appreciate.

Thanks,

Chathuri

Harsh J · ‎09-06-2017

It appears from your error that your rate of insert is much higher than the rate of flushing. When you do regular mutates (Puts/Deletes) via HBase APIs, the data lands in the WAL and the MemStore. The error is indicating that the MemStore for the targeted region has exceeded its blocking capacity.

Usually, when the MemStore for a region nears its configured limit (such as 256 MB), it triggers a HDFS flush. Flushing ~256 MB should be quick enough that the MemStore can be trimmed down again. However, in your case the Flush is likely blocked (waiting in a queue, or waiting on HDFS I/O) or is taking very long.

Some ideas:

Look in your RegionServer logs (moe-cn05 for example) for "[Ff]lush" related messages around the time of the issue (2017-09-01 ~0700 hours). If you are observing small data size flushes completing in long times, the issue may be on the HDFS I/O (Investigate NN response times, DN connectivity, Network and Disk I/O).

If you are seeing flushes occur in regular time, then it may be the flush request queue (CM has an alert for this). You can see the metrics of this RS to find out how many flush requests were waiting in the queue at that point. Increasing the total number of parallel flusher work threads can help drain the request queue faster.

If you're observing no flushes complete, it could be a bug or a hang due to some custom logic (if you use coprocessors). Use a jstack output (or visit /stacks on the RS Web UI) to analyze where the flusher threads are hung or if they are waiting to lock some resource thats hung in another thread.

chathurip · ‎09-06-2017

Hi Harsh,

Thank you so much for suggestions. I looked at region server logs around that time but I could not find any errors. But there were some errors around 07:37.

2017-09-01 07:37:19,347 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region tweetTable-2017-08,,1504119689670.257e69e222e3577c8b96ec34572f4aa8. has too many store files; delaying flush up to 90000ms

2017-09-01 07:38:49,358 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Waited 90011ms on a compaction to clean up 'too many store files'; waited long enough... proceeding with flush of tweetTable-2017-08,,1504119689670.257e69e222e3577c8b96ec34572f4aa8.
2017-09-01 07:38:49,358 INFO org.apache.hadoop.hbase.regionserver.HRegion: Flushing 1/1 column families, memstore=2.07 GB
2017-09-01 07:39:25,296 INFO org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed, sequenceid=55001, memsize=2.1 G, hasBloomFilter=true, into tmp file hdfs://nameservice1/hbase/data/default/tweetTable-2017-08/257e69e222e3577c8b96ec34572f4aa8/.tmp/39d43e930f454644a34f2899ba7ec49e
2017-09-01 07:39:25,320 INFO org.apache.hadoop.hbase.regionserver.HStore: Added hdfs://nameservice1/hbase/data/default/tweetTable-2017-08/257e69e222e3577c8b96ec34572f4aa8/d/39d43e930f454644a34f2899ba7ec49e, entries=7607196, sequenceid=55001, filesize=133.8 M
2017-09-01 07:39:25,322 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~2.07 GB/2222782950, currentsize=0 B/0 for region tweetTable-2017-08,,1504119689670.257e69e222e3577c8b96ec34572f4aa8. in 35964ms, sequenceid=55001, compaction requested=true

But it seems region server was able to flush the memstore. I checked the namenode logs for the same period. But I could not find any warnings or errors.

chathurip · ‎09-14-2017

I did some configuration changes and it fixed the issue.

Thanks,

Chathuri

Harsh J · ‎09-14-2017

For posterity, would you be willing to share what those config changes were?

In spirit of https://xkcd.com/979/ 🙂

chathurip · ‎09-14-2017

I followed this artical (http://gbif.blogspot.com/2012/07/optimizing-writes-in-hbase.html) and change below parameters.

Maximum Number of HStoreFiles Compaction : 20
HStore Blocking Store Files : 200
HBase Memstore Block Multiplier : 4
HBase Memstore Flush Size : 256

Hope this helps.

Thanks,

Chathuri

Cloudera Community

Support Questions

HBase region servers giving org.apache.hadoop.hbase.RegionTooBusyException when inserting data