Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hbase : All region servers down and hbase connectivity lost

Hbase : All region servers down and hbase connectivity lost

Explorer

Hi.

 

We have been facing this issue many times till now. All region servers go down and then Hbase master when try to do log splitting it throws out error and entire hbase conenctivity is lost until we restart the cluster.Did anyone face this kind of issue ,it would be great if we know the root cause of why servers are going down and why Hbase master is not able to make the region servers active through log splitting? 

 

2017-07-11 00:03:32,380 WARN org.apache.hadoop.hbase.master.SplitLogManager: error while splitting logs in [hdfs://nameservice1/hbase/WALs/hostnamedn01.com,60020,1438807980302-splitting] installed = 1 but only 0 done
2017-07-11 00:03:32,381 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for hostnamedn01.com,60020,1438807980302, will retry
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:346)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:219)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://nameservice1/hbase/WALs/hostnamedn01.com,60020,1438807980302-splitting] Task = installed = 1 done = 0 error = 1
at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:291)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:212)
... 4 more
2017-07-11 00:03:32,385 INFO org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fhostnamedn05.com%2C60020%2C1438807980454-splitting%2Fhostnamedn05.com%252C60020%252C1438807980454.1438865586074 acquired by hostnamedn05.com,60020,1498483521347
2017-07-11 00:03:32,477 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for hostnamedn01.com,60020,1438807980302 before assignment; region count=0
2017-07-11 00:03:32,481 INFO org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [hostnamedn01.com,60020,1438807980302]
2017-07-11 00:03:32,482 INFO org.apache.hadoop.hbase.master.SplitLogManager: started splitting 1 logs in [hdfs://nameservice1/hbase/WALs/hostnamedn01.com,60020,1438807980302-splitting] for [hostnamedn01.com,60020,1438807980302]
2017-07-11 00:03:37,505 INFO org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 4 unassigned = 2 tasks={/hbase/splitWAL/WALs%2Fhostnamedn03.com%2C60020%2C1438807980381-splitting%2Fhostnamedn03.com%252C60020%252C1438807980381.1438865589887=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn02.com%2C60020%2C1438807980385-splitting%2Fhostnamedn02.com%252C60020%252C1438807980385.1438865587122=last_update = 1499749403398 last_version = 30 cur_worker_name = hostnamedn05.com,60020,1498483521347 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn05.com%2C60020%2C1438807980454-splitting%2Fhostnamedn05.com%252C60020%252C1438807980454.1438865586074=last_update = 1499749412430 last_version = 2 cur_worker_name = hostnamedn05.com,60020,1498483521347 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn01.com%2C60020%2C1438807980302-splitting%2Fhostnamedn01.com%252C60020%252C1438807980302.1438865586081=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2017-07-11 00:03:43,505 INFO org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 4 unassigned = 2 tasks={/hbase/splitWAL/WALs%2Fhostnamedn03.com%2C60020%2C1438807980381-splitting%2Fhostnamedn03.com%252C60020%252C1438807980381.1438865589887=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn02.com%2C60020%2C1438807980385-splitting%2Fhostnamedn02.com%252C60020%252C1438807980385.1438865587122=last_update = 1499749403398 last_version = 30 cur_worker_name = hostnamedn05.com,60020,1498483521347 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn05.com%2C60020%2C1438807980454-splitting%2Fhostnamedn05.com%252C60020%252C1438807980454.1438865586074=last_update = 1499749412430 last_version = 2 cur_worker_name = hostnamedn05.com,60020,1498483521347 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn01.com%2C60020%2C1438807980302-splitting%2Fhostnamedn01.com%252C60020%252C1438807980302.1438865586081=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2017-07-11 00:03:49,411 INFO org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fhostnamedn02.com%2C60020%2C1438807980385-splitting%2Fhostnamedn02.com%252C60020%252C1438807980385.1438865587122 entered state: ERR hostnamedn05.com,60020,1498483521347
2017-07-11 00:03:49,412 WARN org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fhostnamedn02.com%2C60020%2C1438807980385-splitting%2Fhostnamedn02.com%252C60020%252C1438807980385.1438865587122
2017-07-11 00:03:49,412 WARN org.apache.hadoop.hbase.master.SplitLogManager: error while splitting logs in [hdfs://nameservice1/hbase/WALs/hostnamedn02.com,60020,1438807980385-splitting] installed = 1 but only 0 done
2017-07-11 00:03:49,413 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for hostnamedn02.com,60020,1438807980385, will retry
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:346)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:219)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://nameservice1/hbase/WALs/hostnamedn02.com,60020,1438807980385-splitting] Task = installed = 1 done = 0 error = 1
at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:291)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:212)

2 REPLIES 2

Re: Hbase : All region servers down and hbase connectivity lost

Master Guru
HBase is required to perform log split if an RS goes down uncleanly.

On why your RSs went down uncleanly, you'd need to check for FATAL messages in your independent RS logs, as the reason is not in the Master log snippet posted above. The dead server appears to have been hostnamedn02.com.

On why the log splitting fails, since Master does a distributed log split, the reason of failure would also exist on the alive RS logs that tried to assist with the log splitting. In the snippet posted above, this host was hostnamedn01.com and hostnamedn05.com.

Re: Hbase : All region servers down and hbase connectivity lost

New Contributor

I have similar problem can you please help

 

logs of HMaster-

 

2019-05-21 15:16:39,745 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2F10.136.107.153%2C16020%2C1558423482915-splitting%2F10.136.107.153%252C16020%252C1558423482915.default.1558426329243
2019-05-21 15:16:39,745 WARN [MASTER_SERVER_OPERATIONS-DIGPUNPERHEL02:16000-0] master.SplitLogManager: error while splitting logs in [hdfs://10.136.107.59:9000/hbase/WALs/10.136.107.153,16020,1558423482915-splitting] installed = 3 but only 0 done
2019-05-21 15:16:39,747 ERROR [MASTER_SERVER_OPERATIONS-DIGPUNPERHEL02:16000-0] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for 10.136.107.153,16020,1558423482915, will retry
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://10.136.107.59:9000/hbase/WALs/10.136.107.153,16020,1558423482915-splitting] Task = installed = 3 done = 0 error = 3
at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:213)
... 4 more
2019-05-21 15:16:39,748 FATAL [MASTER_SERVER_OPERATIONS-DIGPUNPERHEL02:16000-0] master.HMaster: Master server abort: loaded coprocessors are: []
2019-05-21 15:16:39,748 FATAL [MASTER_SERVER_OPERATIONS-DIGPUNPERHEL02:16000-0] master.HMaster: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for 10.136.107.153,16020,1558423482915, will retry
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:357)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://10.136.107.59:9000/hbase/WALs/10.136.107.153,16020,1558423482915-splitting] Task = installed = 3 done = 0 error = 3

 

Logs of HRegionServer -

 

2019-05-21 15:16:38,669 INFO [RS_LOG_REPLAY_OPS-DIGPUNPERHEL02:16020-0-Writer-1] wal.WALSplitter: Creating writer path=hdfs://10.136.107.59:9000/hbase/data/dev2/observation/8baa93c0a9ddc9ab4ebfead1a50d85b2/recovered.edits/0000000000002354365.temp region=8baa93c0a9ddc9ab4ebfead1a50d85b2
2019-05-21 15:16:38,701 WARN [Thread-101] hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/data/dev2/observation/8baa93c0a9ddc9ab4ebfead1a50d85b2/recovered.edits/0000000000002354365.temp could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1547)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:724)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

at org.apache.hadoop.ipc.Client.call(Client.java:1411)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy16.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

Don't have an account?
Coming from Hortonworks? Activate your account here