Created on 01-08-2020 10:41 AM - last edited on 01-09-2020 01:23 AM by VidyaSargur
I am using HDP-2.6.0.3 with Hbase 1.1.2.2.6.
Everthing was working fine on cluster, but one day due to connection issue within cluster,
some application starts failing which is communicating to cluster.
Application Log :
05-01-20 16:18:28,497 W a.r.RemoteWatcher [flink-akka.actor.default-dispatcher-54
0] Detected unreachable: [akka.tcp://regionServer1.com:40970]
05-01-20 16:18:28,502 I o.a.f.r.c.JobSubmissionClientActor [flink-akka.actor.defa
ult-dispatcher-543] Lost connection to JobManager akka.tcp://flink@regionServer1.com:40970/user/jobmanager. Triggering connection timeout.
05-01-20 16:18:28,502 I o.a.f.r.c.JobSubmissionClientActor [flink-akka.actor.defa
ult-dispatcher-543] Disconnect from JobManager Actor[akka.tcp://flink@regionServer1.com:40970/user/jobmanager#491044547].
Region Server Log
2020-01-05 16:22:56,764 DEBUG [regionserver/abcd.com/10.101.101.11:16020-SendThread(bcde.com:2181)] zookeeper.ClientCnxn: Got ping response for sessionid: 0x36c407bafefad06 after 0ms
2020-01-05 16:23:03,091 DEBUG [main-SendThread(bcde.com:2181)] zookeeper.ClientCnxn: Got ping response for sessionid: 0x36c407bafefad05 after 0ms
2020-01-05 16:23:03,973 ERROR [RS_OPEN_REGION-regionServer1:16020-38] handler.OpenRegionHandler: Failed open of region=table_699_20190801,201710_SP909_49157723046,1510286652884.9940c59ac9d10fce3c06070de8a56548., starting to roll back the global memstore size.
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for ringBufferSequence=15308059, WAL system stuck?
at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:174)
at org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1581)
at org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:1575)
at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1715)
at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:97)
at org.apache.hadoop.hbase.regionserver.HRegion.writeRegionOpenMarker(HRegion.java:1046)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6602)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6556)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6527)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6483)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6434)
at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2020-01-05 16:23:03,973 INFO [RS_OPEN_REGION-regionServer1:16020-38] coordination.ZkOpenRegionCoordination: Opening of region {ENCODED => 9940c59ac9d10fce3c06070de8a56548, NAME => 'table_699_20190801,201710_SP909_49157723046,1510286652884.9940c59ac9d10fce3c06070de8a56548.', STARTKEY => '201710_SP909_49157723046', ENDKEY => '201710_SP909_49177977765'} failed, transitioning from OPENING to FAILED_OPEN in ZK, expecting version 1
And logs on Hbase Master
2020-01-08 11:03:27,942 WARN [RegionOpenAndInitThread-table_2-10] ipc.Client:
interrupted waiting to send rpc request to server
java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:10
94)
at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1398)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEng
ine.java:233)
at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.
getFileInfo(ClientNamenodeProtocolTranslatorPB.java:818)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203)
2020-01-08 11:03:28,168 WARN [ProcedureExecutorThread-28] procedure.TruncateTab
leProcedure: Retriable error trying to truncate table=table_2 state=TRUNCATE_T
ABLE_CREATE_FS_LAYOUT
java.io.IOException: java.util.concurrent.ExecutionException: java.io.IOExceptio
n: The specified region already exists on disk: hdfs://my-hdfs/apps/hbase/
data/.tmp/data/default/table_2/ced821255f3f09a6f8e3d8d7335385e8
at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:186)
at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:141)
at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:118)
at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure$3.createHdfsRegions(CreateTableProcedure.java:361)
at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.createFsLayout(CreateTableProcedure.java:380)
at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.createFsLayout(CreateTableProcedure.java:354)
at org.apache.hadoop.hbase.master.procedure.TruncateTableProcedure.executeFromState(TruncateTableProcedure.java:113)
at org.apache.hadoop.hbase.master.procedure.TruncateTableProcedure.executeFromState(TruncateTableProcedure.java:47)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: The specified region already exists on disk: hdfs://my-hdfs/apps/hbase/data/.tmp/data/default/table_2/ced821255f3f09a6f8e3d8d7335385e8
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:180)
... 14 more
Caused by: java.io.IOException: The specified region already exists on disk: hdfs://my-hdfs/apps/hbase/data/.tmp/data/default/table_2/ced821255f3f09a6f8e3d8d7335385e8
at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.createRegionOnFileSystem(HRegionFileSystem.java:900)
at org.apache.hadoop.hbase.regionserver.HRegion.createHRegion(HRegion.java:6364)
at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegion(ModifyRegionUtils.java:205)
at org.apache.hadoop.hbase.util.ModifyRegionUtils$1.call(ModifyRegionUtils.java:173)
at org.apache.hadoop.hbase.util.ModifyRegionUtils$1.call(ModifyRegionUtils.java:170)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Could anyone can help us how to fix those issues on Hbase cluster to start Master up and running stable ?
Using seperate server for Hbase Master, 6 Region Server, 3 Zookeper server, 2 Phoenix-query-server
Created 01-08-2020 10:18 PM
How many regions you have per region server? Do you see region transit while you restart service?
Check /hbase/WALs directory (in hdfs) and if you find there are region files followed by .splitting, that is not good. As a work around you can increase the timeout value and number of threads used splitting wals.
Alternatives you can delete/move the splitting region wal files and restart hbase. But I don't recommend this on the production system
Created on 01-09-2020 05:51 AM - edited 01-09-2020 05:52 AM
Hi Subhasis,
Approximate regions are 103 or 102 each in 6 region server. Total count is 617.
There are .splitting files present at the location. But many old .splitting files also present from 2017, 2018, when hbase was running fine
There are lot of configuration related to size and timeout present on Ambari GUI, which specific property to change is confusing to me.
What is the root cause of this still not found. Looking for any workaround if possible ?
Created 01-10-2020 05:20 AM
hbase.splitlog.manager.timeout | 10 min (600000 ms) |
hbase.splitlog.manager.unassigned.timeout | 10 min (600000 ms) |
hbase.regionserver.wal.max.splitters | 5 to 10 |
hbase.regionserver.hlog.splitlog.writer.threads | 10 |
this is a workaround not a setting recommended for production system