Reply
Highlighted
Contributor
Posts: 31
Registered: ‎07-11-2017

Hbase : All region servers down and hbase connectivity lost

Hi.

 

We have been facing this issue many times till now. All region servers go down and then Hbase master when try to do log splitting it throws out error and entire hbase conenctivity is lost until we restart the cluster.Did anyone face this kind of issue ,it would be great if we know the root cause of why servers are going down and why Hbase master is not able to make the region servers active through log splitting? 

 

2017-07-11 00:03:32,380 WARN org.apache.hadoop.hbase.master.SplitLogManager: error while splitting logs in [hdfs://nameservice1/hbase/WALs/hostnamedn01.com,60020,1438807980302-splitting] installed = 1 but only 0 done
2017-07-11 00:03:32,381 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for hostnamedn01.com,60020,1438807980302, will retry
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:346)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:219)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://nameservice1/hbase/WALs/hostnamedn01.com,60020,1438807980302-splitting] Task = installed = 1 done = 0 error = 1
at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:291)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:212)
... 4 more
2017-07-11 00:03:32,385 INFO org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fhostnamedn05.com%2C60020%2C1438807980454-splitting%2Fhostnamedn05.com%252C60020%252C1438807980454.1438865586074 acquired by hostnamedn05.com,60020,1498483521347
2017-07-11 00:03:32,477 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for hostnamedn01.com,60020,1438807980302 before assignment; region count=0
2017-07-11 00:03:32,481 INFO org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [hostnamedn01.com,60020,1438807980302]
2017-07-11 00:03:32,482 INFO org.apache.hadoop.hbase.master.SplitLogManager: started splitting 1 logs in [hdfs://nameservice1/hbase/WALs/hostnamedn01.com,60020,1438807980302-splitting] for [hostnamedn01.com,60020,1438807980302]
2017-07-11 00:03:37,505 INFO org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 4 unassigned = 2 tasks={/hbase/splitWAL/WALs%2Fhostnamedn03.com%2C60020%2C1438807980381-splitting%2Fhostnamedn03.com%252C60020%252C1438807980381.1438865589887=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn02.com%2C60020%2C1438807980385-splitting%2Fhostnamedn02.com%252C60020%252C1438807980385.1438865587122=last_update = 1499749403398 last_version = 30 cur_worker_name = hostnamedn05.com,60020,1498483521347 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn05.com%2C60020%2C1438807980454-splitting%2Fhostnamedn05.com%252C60020%252C1438807980454.1438865586074=last_update = 1499749412430 last_version = 2 cur_worker_name = hostnamedn05.com,60020,1498483521347 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn01.com%2C60020%2C1438807980302-splitting%2Fhostnamedn01.com%252C60020%252C1438807980302.1438865586081=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2017-07-11 00:03:43,505 INFO org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 4 unassigned = 2 tasks={/hbase/splitWAL/WALs%2Fhostnamedn03.com%2C60020%2C1438807980381-splitting%2Fhostnamedn03.com%252C60020%252C1438807980381.1438865589887=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn02.com%2C60020%2C1438807980385-splitting%2Fhostnamedn02.com%252C60020%252C1438807980385.1438865587122=last_update = 1499749403398 last_version = 30 cur_worker_name = hostnamedn05.com,60020,1498483521347 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn05.com%2C60020%2C1438807980454-splitting%2Fhostnamedn05.com%252C60020%252C1438807980454.1438865586074=last_update = 1499749412430 last_version = 2 cur_worker_name = hostnamedn05.com,60020,1498483521347 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0, /hbase/splitWAL/WALs%2Fhostnamedn01.com%2C60020%2C1438807980302-splitting%2Fhostnamedn01.com%252C60020%252C1438807980302.1438865586081=last_update = -1 last_version = -1 cur_worker_name = null status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
2017-07-11 00:03:49,411 INFO org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fhostnamedn02.com%2C60020%2C1438807980385-splitting%2Fhostnamedn02.com%252C60020%252C1438807980385.1438865587122 entered state: ERR hostnamedn05.com,60020,1498483521347
2017-07-11 00:03:49,412 WARN org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fhostnamedn02.com%2C60020%2C1438807980385-splitting%2Fhostnamedn02.com%252C60020%252C1438807980385.1438865587122
2017-07-11 00:03:49,412 WARN org.apache.hadoop.hbase.master.SplitLogManager: error while splitting logs in [hdfs://nameservice1/hbase/WALs/hostnamedn02.com,60020,1438807980385-splitting] installed = 1 but only 0 done
2017-07-11 00:03:49,413 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
java.io.IOException: failed log splitting for hostnamedn02.com,60020,1438807980385, will retry
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:346)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:219)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://nameservice1/hbase/WALs/hostnamedn02.com,60020,1438807980385-splitting] Task = installed = 1 done = 0 error = 1
at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:291)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:391)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:364)
at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:286)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:212)

Posts: 1,566
Kudos: 287
Solutions: 240
Registered: ‎07-31-2013

Re: Hbase : All region servers down and hbase connectivity lost

HBase is required to perform log split if an RS goes down uncleanly.

On why your RSs went down uncleanly, you'd need to check for FATAL messages in your independent RS logs, as the reason is not in the Master log snippet posted above. The dead server appears to have been hostnamedn02.com.

On why the log splitting fails, since Master does a distributed log split, the reason of failure would also exist on the alive RS logs that tried to assist with the log splitting. In the snippet posted above, this host was hostnamedn01.com and hostnamedn05.com.
Backline Customer Operations Engineer
Announcements