Created 06-13-2018 01:01 PM
Hi,
The hadoop installation I am working with have total 4 machines, and it has 4 region servers deployed on all of them. Hadoop keeps working fine for 7 to 10 days, and then start building inconsistencies over time. After some days, region servers start to go down 1 by 1, and keep throwing back Timeout exceptions and NotServingRegionException.
All the logs are filled with the following exceptions, and if we restart hadoop, it starts working just fine, and inconsistencies automatically go away.
But we don't want to monitor the system all the time and then manually restart it after every 5 to 10 days.
2018-06-10 15:06:30,532 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Aborting procedure (Event_V1_nc_2018-05-15) in zk 2018-06-10 15:06:30,534 INFO org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15 2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15 2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15 already exists 2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Finished creating abort znode:/hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15 2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ProcedureMember: Request received to abort procedure Event_V1_nc_2018-05-15 org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@722f85b2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528635930532, End:1528635990532, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:332) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:619) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528635930532, End:1528635990532, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:69) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) 2018-06-10 15:06:30,535 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15
2018-06-11 02:23:33,801 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Master is not running yet 2018-06-11 02:23:33,801 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying. 2018-06-11 02:23:36,802 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=prod-ehdb1,60000,1526469728437 with port=60020, startcode=1528676538620
2018-06-11 02:23:09,623 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: error telling master we are up com.google.protobuf.ServiceException: java.io.IOException: Call to prod-ehdb1/172.16.2.36:60000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:240) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982) at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2295) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:911) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Call to prod-ehdb1/172.16.2.36:60000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.AbstractRpcClient.wrapException(AbstractRpcClient.java:292) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1273) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227) ... 5 more Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired. at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:73) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1247) ... 6 more 2018-06-11 02:23:09,623 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying. 2018-06-11 02:23:12,624 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=prod-ehdb1,60000,1526469728437 with port=60020, startcode=1528676538620 2018-06-11 02:23:12,784 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Master is not running yet
org.apache.hadoop.hbase.NotServingRegionException: Region EH_ReportProfiles_V1,,1509608701055.7d9965ccdeaab07193c53deffa6cd876. is not online on prod-ehdb2,60020,1526469743500 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2921) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1053) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2384) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33648) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:185) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:165)
/hbase/data/default/EH_EhafCache_V1/7c155487897f0b21b1e204594bd767a6/.tmp/e33a609842144f24aaac3fbd3634e93c retrying... 2018-06-11 02:37:01,824 WARN org.apache.hadoop.hbase.regionserver.HStore: Failed flushing store file, retrying num=4 java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2678) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2640) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2604) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:248)
2018-06-10 11:25:17,693 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15 org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2ea555d4:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:332) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:142) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$200(ZKProcedureMemberRpcs.java:55) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:110) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:640) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:69) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) 2018-06-10 11:25:17,693 ERROR org.apache.hadoop.hbase.procedure.Subprocedure: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15 org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2ea555d4:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
Created 06-13-2018 02:01 PM
HI,
Usually timeout happen because the cluster is undersized or no dedicated node for hbase or the ingestion is so quick that hbase need to do a lot of split of region.
- Do you manage a lot of data with hbase? if yes, idd you pre-split your table?
- If i was you I would also have a look to the cpu, memory and io disk usage. If you dont have anydedicated nodes for hbase other hadoop component like spark, hive, etc can have an impact.
As a general best practice, you should have dedicated node for hbase with enough cpu and several disk
Mest regards,
Michel