Support Questions

Find answers, ask questions, and share your expertise

Hbase master failed after startup

avatar
New Contributor

I am using HDP-2.6.0.3 with Hbase 1.1.2.2.6.

 

Everthing was working fine on cluster, but one day due to connection issue within cluster,

some application starts failing which is communicating to cluster.

 

Application Log :

 

05-01-20 16:18:28,497 W a.r.RemoteWatcher [flink-akka.actor.default-dispatcher-54
0] Detected unreachable: [akka.tcp://regionServer1.com:40970]
05-01-20 16:18:28,502 I o.a.f.r.c.JobSubmissionClientActor [flink-akka.actor.defa
ult-dispatcher-543] Lost connection to JobManager akka.tcp://flink@regionServer1.com:40970/user/jobmanager. Triggering connection timeout.
05-01-20 16:18:28,502 I o.a.f.r.c.JobSubmissionClientActor [flink-akka.actor.defa
ult-dispatcher-543] Disconnect from JobManager Actor[akka.tcp://flink@regionServer1.com:40970/user/jobmanager#491044547].

 

Region Server Log 

 

2020-01-05 16:22:56,764 DEBUG [regionserver/abcd.com/10.101.101.11:16020-SendThread(bcde.com:2181)] zookeeper.ClientCnxn: Got ping response for sessionid: 0x36c407bafefad06 after 0ms
2020-01-05 16:23:03,091 DEBUG [main-SendThread(bcde.com:2181)] zookeeper.ClientCnxn: Got ping response for sessionid: 0x36c407bafefad05 after 0ms
2020-01-05 16:23:03,973 ERROR [RS_OPEN_REGION-regionServer1:16020-38] handler.OpenRegionHandler: Failed open of region=table_699_20190801,201710_SP909_49157723046,1510286652884.9940c59ac9d10fce3c06070de8a56548., starting to roll back the global memstore size.
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for ringBufferSequence=15308059, WAL system stuck?
        at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:174)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1581)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:1575)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1715)
        at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeRegionEventMarker(WALUtil.java:97)
        at org.apache.hadoop.hbase.regionserver.HRegion.writeRegionOpenMarker(HRegion.java:1046)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6602)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6556)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6527)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6483)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6434)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2020-01-05 16:23:03,973 INFO  [RS_OPEN_REGION-regionServer1:16020-38] coordination.ZkOpenRegionCoordination: Opening of region {ENCODED => 9940c59ac9d10fce3c06070de8a56548, NAME => 'table_699_20190801,201710_SP909_49157723046,1510286652884.9940c59ac9d10fce3c06070de8a56548.', STARTKEY => '201710_SP909_49157723046', ENDKEY => '201710_SP909_49177977765'} failed, transitioning from OPENING to FAILED_OPEN in ZK, expecting version 1

 

 

And logs on Hbase Master

 

2020-01-08 11:03:27,942 WARN  [RegionOpenAndInitThread-table_2-10] ipc.Client:
 interrupted waiting to send rpc request to server
java.lang.InterruptedException
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
        at java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:10
94)
        at org.apache.hadoop.ipc.Client.call(Client.java:1457)
        at org.apache.hadoop.ipc.Client.call(Client.java:1398)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEng
ine.java:233)
        at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.
getFileInfo(ClientNamenodeProtocolTranslatorPB.java:818)
        at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203)


2020-01-08 11:03:28,168 WARN  [ProcedureExecutorThread-28] procedure.TruncateTab
leProcedure: Retriable error trying to truncate table=table_2 state=TRUNCATE_T
ABLE_CREATE_FS_LAYOUT
java.io.IOException: java.util.concurrent.ExecutionException: java.io.IOExceptio
n: The specified region already exists on disk: hdfs://my-hdfs/apps/hbase/
data/.tmp/data/default/table_2/ced821255f3f09a6f8e3d8d7335385e8
        at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:186)
        at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:141)
        at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:118)
        at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure$3.createHdfsRegions(CreateTableProcedure.java:361)
        at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.createFsLayout(CreateTableProcedure.java:380)
        at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.createFsLayout(CreateTableProcedure.java:354)
        at org.apache.hadoop.hbase.master.procedure.TruncateTableProcedure.executeFromState(TruncateTableProcedure.java:113)
        at org.apache.hadoop.hbase.master.procedure.TruncateTableProcedure.executeFromState(TruncateTableProcedure.java:47)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: The specified region already exists on disk: hdfs://my-hdfs/apps/hbase/data/.tmp/data/default/table_2/ced821255f3f09a6f8e3d8d7335385e8
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:180)
        ... 14 more
Caused by: java.io.IOException: The specified region already exists on disk: hdfs://my-hdfs/apps/hbase/data/.tmp/data/default/table_2/ced821255f3f09a6f8e3d8d7335385e8
        at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.createRegionOnFileSystem(HRegionFileSystem.java:900)
        at org.apache.hadoop.hbase.regionserver.HRegion.createHRegion(HRegion.java:6364)
        at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegion(ModifyRegionUtils.java:205)
        at org.apache.hadoop.hbase.util.ModifyRegionUtils$1.call(ModifyRegionUtils.java:173)
        at org.apache.hadoop.hbase.util.ModifyRegionUtils$1.call(ModifyRegionUtils.java:170)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

 

 

Could anyone can help us how to fix those issues on Hbase cluster to start Master up and running stable ?

 

Using seperate server for Hbase Master, 6 Region Server, 3 Zookeper server, 2 Phoenix-query-server

3 REPLIES 3

avatar
Contributor

How many regions you have per region server? Do you see region transit while you restart service?

 

Check /hbase/WALs directory (in hdfs) and if you find there are region files followed by .splitting, that is not good. As a work around you can increase the timeout value and number of threads used splitting wals.

Alternatives you can delete/move the splitting region wal files and restart hbase. But I don't recommend  this on the production system

avatar
New Contributor

Hi Subhasis,

 

Approximate regions are 103 or 102 each in 6 region server. Total count is 617.

There are .splitting files present at the location. But many old .splitting files also present from 2017, 2018, when hbase was running fine

 

There are lot of configuration related to size and timeout present on Ambari GUI, which specific property to change is confusing to me. 

 

What is the root cause of this still not found. Looking for any workaround if possible ?

avatar
Contributor

hbase.splitlog.manager.timeout

10 min (600000 ms)

hbase.splitlog.manager.unassigned.timeout

10 min  (600000 ms)

hbase.regionserver.wal.max.splitters

5 to 10

hbase.regionserver.hlog.splitlog.writer.threads

10

this is a workaround not a setting recommended for production system