Support Questions
Find answers, ask questions, and share your expertise

Region Servers start throwing Timeout exceptions 1 by 1 and eventually entire hadoop dies after working 7 to 10 days

Region Servers start throwing Timeout exceptions 1 by 1 and eventually entire hadoop dies after working 7 to 10 days

New Contributor

Hi,
The hadoop installation I am working with have total 4 machines, and it has 4 region servers deployed on all of them. Hadoop keeps working fine for 7 to 10 days, and then start building inconsistencies over time. After some days, region servers start to go down 1 by 1, and keep throwing back Timeout exceptions and NotServingRegionException.

All the logs are filled with the following exceptions, and if we restart hadoop, it starts working just fine, and inconsistencies automatically go away.
But we don't want to monitor the system all the time and then manually restart it after every 5 to 10 days.

2018-06-10 15:06:30,532 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Aborting procedure (Event_V1_nc_2018-05-15) in zk
2018-06-10 15:06:30,534 INFO org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15 already exists
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Finished creating abort znode:/hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ProcedureMember: Request received to abort procedure Event_V1_nc_2018-05-15
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@722f85b2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528635930532, End:1528635990532, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:332)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:619)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528635930532, End:1528635990532, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:69)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)
2018-06-10 15:06:30,535 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15
2018-06-11 02:23:33,801 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Master is not running yet
2018-06-11 02:23:33,801 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.
2018-06-11 02:23:36,802 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=prod-ehdb1,60000,1526469728437 with port=60020, startcode=1528676538620
2018-06-11 02:23:09,623 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: error telling master we are up
com.google.protobuf.ServiceException: java.io.IOException: Call to prod-ehdb1/172.16.2.36:60000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired.
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:240)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2295)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:911)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Call to prod-ehdb1/172.16.2.36:60000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired.
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.wrapException(AbstractRpcClient.java:292)
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1273)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
	... 5 more
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired.
	at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:73)
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1247)
	... 6 more
2018-06-11 02:23:09,623 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.
2018-06-11 02:23:12,624 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=prod-ehdb1,60000,1526469728437 with port=60020, startcode=1528676538620
2018-06-11 02:23:12,784 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Master is not running yet
org.apache.hadoop.hbase.NotServingRegionException: Region EH_ReportProfiles_V1,,1509608701055.7d9965ccdeaab07193c53deffa6cd876. is not online on prod-ehdb2,60020,1526469743500
	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2921)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1053)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2384)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33648)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:185)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:165)
/hbase/data/default/EH_EhafCache_V1/7c155487897f0b21b1e204594bd767a6/.tmp/e33a609842144f24aaac3fbd3634e93c retrying...
2018-06-11 02:37:01,824 WARN org.apache.hadoop.hbase.regionserver.HStore: Failed flushing store file, retrying num=4
java.io.IOException: Unable to close file because the last block does not have enough number of replicas.
	at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2678)
	at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2640)
	at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2604)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
	at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:248)
2018-06-10 11:25:17,693 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2ea555d4:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:332)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:142)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$200(ZKProcedureMemberRpcs.java:55)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:110)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:640)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:69)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)
2018-06-10 11:25:17,693 ERROR org.apache.hadoop.hbase.procedure.Subprocedure: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2ea555d4:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
1 REPLY 1

Re: Region Servers start throwing Timeout exceptions 1 by 1 and eventually entire hadoop dies after working 7 to 10 days

Expert Contributor

HI,

Usually timeout happen because the cluster is undersized or no dedicated node for hbase or the ingestion is so quick that hbase need to do a lot of split of region.
- Do you manage a lot of data with hbase? if yes, idd you pre-split your table?

- If i was you I would also have a look to the cpu, memory and io disk usage. If you dont have anydedicated nodes for hbase other hadoop component like spark, hive, etc can have an impact.
As a general best practice, you should have dedicated node for hbase with enough cpu and several disk

Mest regards,
Michel