Support Questions

Find answers, ask questions, and share your expertise

Region Servers start throwing Timeout exceptions 1 by 1 and eventually entire hadoop dies after work

avatar
New Contributor

Hi,
The hadoop installation I am working with have total 4 machines, and it has 4 region servers deployed on all of them. Hadoop keeps working fine for 7 to 10 days, and then start building inconsistencies over time. After some days, region servers start to go down 1 by 1, and keep throwing back Timeout exceptions and NotServingRegionException.

All the logs are filled with the following exceptions, and if we restart hadoop, it starts working just fine, and inconsistencies automatically go away.
But we don't want to monitor the system all the time and then manually restart it after every 5 to 10 days.

 

The region server and hbase master logs are full of the errors I am pasting below, and they start happening just anytime after working fine for many days.

 

2018-06-10 15:06:30,532 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Aborting procedure (Event_V1_nc_2018-05-15) in zk
2018-06-10 15:06:30,534 INFO org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15 already exists
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Finished creating abort znode:/hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ProcedureMember: Request received to abort procedure Event_V1_nc_2018-05-15
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@722f85b2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528635930532, End:1528635990532, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:332)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:619)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528635930532, End:1528635990532, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:69)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)
2018-06-10 15:06:30,535 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15
2018-06-11 02:23:33,801 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Master is not running yet
2018-06-11 02:23:33,801 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.
2018-06-11 02:23:36,802 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=prod-ehdb1,60000,1526469728437 with port=60020, startcode=1528676538620
2018-06-11 02:23:09,623 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: error telling master we are up
com.google.protobuf.ServiceException: java.io.IOException: Call to prod-ehdb1/172.16.2.36:60000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired.
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:240)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2295)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:911)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Call to prod-ehdb1/172.16.2.36:60000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired.
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.wrapException(AbstractRpcClient.java:292)
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1273)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
	... 5 more
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired.
	at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:73)
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1247)
	... 6 more
2018-06-11 02:23:09,623 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.
2018-06-11 02:23:12,624 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=prod-ehdb1,60000,1526469728437 with port=60020, startcode=1528676538620
2018-06-11 02:23:12,784 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Master is not running yet
org.apache.hadoop.hbase.NotServingRegionException: Region EH_ReportProfiles_V1,,1509608701055.7d9965ccdeaab07193c53deffa6cd876. is not online on prod-ehdb2,60020,1526469743500
	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2921)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1053)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2384)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33648)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:185)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:165)

 

/hbase/data/default/EH_EhafCache_V1/7c155487897f0b21b1e204594bd767a6/.tmp/e33a609842144f24aaac3fbd3634e93c retrying...
2018-06-11 02:37:01,824 WARN org.apache.hadoop.hbase.regionserver.HStore: Failed flushing store file, retrying num=4
java.io.IOException: Unable to close file because the last block does not have enough number of replicas.
	at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2678)
	at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2640)
	at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2604)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
	at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:248)
2018-06-10 11:25:17,693 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2ea555d4:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:332)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:142)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$200(ZKProcedureMemberRpcs.java:55)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:110)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:640)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:69)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)
2018-06-10 11:25:17,693 ERROR org.apache.hadoop.hbase.procedure.Subprocedure: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2ea555d4:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
1 REPLY 1

avatar
Expert Contributor

The error snippets posted do not indicate a FATAL event that could interrupt with the main thread of HMaster or RS. Do you see any FATAL events in their respective roles before the crash? If not, check the standard output logs of these roles and see if they record any OOM situation.