Created on 06-13-2018 12:11 PM - edited 09-16-2022 06:20 AM
Hi,
The hadoop installation I am working with have total 4 machines, and it has 4 region servers deployed on all of them. Hadoop keeps working fine for 7 to 10 days, and then start building inconsistencies over time. After some days, region servers start to go down 1 by 1, and keep throwing back Timeout exceptions and NotServingRegionException.
All the logs are filled with the following exceptions, and if we restart hadoop, it starts working just fine, and inconsistencies automatically go away.
But we don't want to monitor the system all the time and then manually restart it after every 5 to 10 days.
The region server and hbase master logs are full of the errors I am pasting below, and they start happening just anytime after working fine for many days.
2018-06-10 15:06:30,532 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Aborting procedure (Event_V1_nc_2018-05-15) in zk 2018-06-10 15:06:30,534 INFO org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15 2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15 2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15 already exists 2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Finished creating abort znode:/hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15 2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ProcedureMember: Request received to abort procedure Event_V1_nc_2018-05-15 org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@722f85b2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528635930532, End:1528635990532, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:332) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:619) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528635930532, End:1528635990532, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:69) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) 2018-06-10 15:06:30,535 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15
2018-06-11 02:23:33,801 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Master is not running yet 2018-06-11 02:23:33,801 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying. 2018-06-11 02:23:36,802 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=prod-ehdb1,60000,1526469728437 with port=60020, startcode=1528676538620
2018-06-11 02:23:09,623 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: error telling master we are up com.google.protobuf.ServiceException: java.io.IOException: Call to prod-ehdb1/172.16.2.36:60000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:240) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982) at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2295) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:911) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Call to prod-ehdb1/172.16.2.36:60000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.AbstractRpcClient.wrapException(AbstractRpcClient.java:292) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1273) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227) ... 5 more Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:73) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1247) ... 6 more 2018-06-11 02:23:09,623 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying. 2018-06-11 02:23:12,624 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=prod-ehdb1,60000,1526469728437 with port=60020, startcode=1528676538620 2018-06-11 02:23:12,784 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Master is not running yet
org.apache.hadoop.hbase.NotServingRegionException: Region EH_ReportProfiles_V1,,1509608701055.7d9965ccdeaab07193c53deffa6cd876. is not online on prod-ehdb2,60020,1526469743500 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2921) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1053) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2384) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33648) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:185) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:165)
/hbase/data/default/EH_EhafCache_V1/7c155487897f0b21b1e204594bd767a6/.tmp/e33a609842144f24aaac3fbd3634e93c retrying... 2018-06-11 02:37:01,824 WARN org.apache.hadoop.hbase.regionserver.HStore: Failed flushing store file, retrying num=4 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2678) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2640) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2604) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:248)
2018-06-10 11:25:17,693 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15 org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2ea555d4:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:332) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:142) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$200(ZKProcedureMemberRpcs.java:55) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:110) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:640) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:69) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) 2018-06-10 11:25:17,693 ERROR org.apache.hadoop.hbase.procedure.Subprocedure: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15 org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2ea555d4:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
Created 03-04-2019 03:18 AM
The error snippets posted do not indicate a FATAL event that could interrupt with the main thread of HMaster or RS. Do you see any FATAL events in their respective roles before the crash? If not, check the standard output logs of these roles and see if they record any OOM situation.