Member since
10-05-2020
5
Posts
0
Kudos Received
0
Solutions
10-25-2021
12:24 AM
Thanks @PrathapKumar to pointed out stuff to check. So far I can confirm on the data nodes there are no: Slow BlockReceiver write data to disk cost Slow BlockReceiver write packet to mirror took Slow flushOrSync took/Slow manageWriterOsCache took Any other WARN/ERROR.
... View more
10-25-2021
12:21 AM
I can confirm `Large batch operation detected` WARN is not a cause of the spikes. The client which produces traffic was identified and disabled. That wasn't resolve an issue. WARN [RpcServer.default.FPBQ.Fifo.handler=10,queue=10,port=16020] regionserver.RSRpcServices: Large batch operation detected (greater than 5000) (HBASE-18023). Requested Number of Rows: 12596 Client: svc-stats//ip first region in multi=table_name,\x09,1541077881948.9bcc8cee00ab92b2402730813923c2f6.
... View more
10-18-2021
08:21 AM
Hi @willx , thanks a lot for your questions! > 1. Is it CDH or HDP, what is the version. HDP 3.1.4.0-315 > 2. In regionserver logs is there “responseTooSlow” or “operationTooSlow” or any other WARN/ERROR messages. please provide log snippets. Yes, I have in the logs “responseTooSlow”, have a look to the example below. But it doesn't correlate with spike times and there are very few amount of them during a day. WARN [RpcServer.default.FPBQ.Fifo.handler=22,queue=3,port=16020] ipc.RpcServer: (responseTooSlow): {"call":"Multi(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$MultiRequest)","starttimems":1634529195627,"responsesize":2846904,"method":"Multi","param":"region= table_name,%,1539591382521.35818b60a3e8dba8d3d1fe0f0d02b292., for 13378 action(s) and 1st row key=&C>\\x15\\x86\\xE7k\\xA6\\xFD5\\ <TRUNCATED>","processingtimems":11644,"client":"ip:port","queuetimems":0,"class":"HRegionServer"}
There are now ERRORs Other WARNs: WARN [RpcServer.default.FPBQ.Fifo.handler=10,queue=10,port=16020] regionserver.RSRpcServices: Large batch operation detected (greater than 5000) (HBASE-18023). Requested Number of Rows: 12596 Client: svc-stats//ip first region in multi=table_name,\x09,1541077881948.9bcc8cee00ab92b2402730813923c2f6. WARN [RpcServer.default.FPBQ.Fifo.handler=55,queue=17,port=16020] regionserver.MultiVersionConcurrencyControl: STUCK: MultiVersionConcurrencyControl{readPoint=3971335621, writePoint=3971335632} WARN [Close-WAL-Writer-3012] asyncfs.FanOutOneBlockAsyncDFSOutputHelper: complete file /foo/WALs/host,port,1633080603058/host%2C16020%2C1633080603058.1634479683029 not finished, retry = 0 For the half of a day the amount of each WARNs is grep WARN hbase-hbase-regionserver.log | grep "2021-10-18" | grep "responseTooSlow" | wc -l
13
grep WARN hbase-hbase-regionserver.log | grep "2021-10-18" | grep "Large batch operation detected" | wc -l
4194
grep WARN hbase-hbase-regionserver.log | grep "2021-10-18" | grep "MultiVersionConcurrencyControl" | wc -l
33
grep WARN hbase-hbase-regionserver.log | grep "2021-10-18" | grep "FanOutOneBlockAsyncDFSOutputHelper" | wc -l
4 > 3. How is the locality of the regions (check locality on hbase webUI, click on table, on right side there is a column shows each region locality.) Locality is 100% on all RS. > 4. How many regions deployed on each RegionServer. I have 5 RS with 79 regions each. For each RS 16GB of heap and 65gb off-heap is allocated. Hadoop cluster backed by SSD. > 5. Any warning / errors in RS log around the spike? No errors. Only warns I mentioned above and I would say only Large batch operation detected (greater than 5000) is popping up a lot. > 6. Is any job trying to scan every 10 min? Which table contribute most I/O? Is there any hotspot. No cron jobs. > 7. is HDFS healthy? check DN logs, is there any slow messages around the spike? Refer to https://my.cloudera.com/knowledge/Diagnosing-Errors-Error-Slow-ReadProcessor-Error-Slow?id=73443 Unfortunately I don't have access to the link. There is no any WARN/ERROR on DN. HDFS looks healthy, cluster serves plenty of requests with very low latency < 10ms.
... View more
10-12-2021
01:01 PM
Hi there, I have an issue on my HBase cluster. HBase version: 2.0.2.3.1.4.0-315 There are latency spikes every 10mins on all HBase operations, mostly visible on reads. Please have a look to the first graph below. Metric for the graph is `hbase_table_latency_gettime_max`. I see also spikes every 10mins on `hbase_regionserver_ipc_queuecalltime`, please have a look to the graph below: What I've checked so far: It doesn't look like GC as GC doesn't correlate with spikes time. It is not a major compaction. I see spikes with and without it. It is not replication. I did a test with and without replication. I see nothing suspicious in logs or at least what could bring my attention: DEBUG and TRACE level was enabled. Memstore flushes are happening every hour. Amount of active handlers looks good to me, it is set according to recommendations There are scans of meta happening every 5 mins (please have a look to the graph below) There are scans of namespace happening every 10 mins and slightly before the spikes (please have a look to the graph below) Could you help me and maybe share some ideas what else I could check. I would much appreciate it.
... View more
Labels:
- Labels:
-
Apache HBase
10-05-2020
06:54 AM
Hi there, I have some issue with disable the region replication for a table. HBase version: 2.0.2.3.1.4.0-315 I have region replication enabled on a single table (REGION_REPLICATION => 2), which doubled amount of regions of the table. I keep it for a while and then changed region replication to 1 (via alter 'table', 'cf', {REGION_REPLICATION => '1'}). What I expected to see is the number of region go down. But it wasn't happened. What I tried: run major compaction restart region servers disable region_replica_replication Does anyone know what is the procedure for disabling a region replication and how to get rid of secondaries regions?
... View more
Labels:
- Labels:
-
Apache HBase