<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Region Servers start throwing Timeout exceptions 1 by 1 and eventually entire hadoop dies after working 7 to 10 days in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Region-Servers-start-throwing-Timeout-exceptions-1-by-1-and/m-p/221415#M183289</link>
    <description>&lt;P&gt;Hi,&lt;BR /&gt;The hadoop installation I am working with have total 4 machines, and it has 4 region servers deployed on all of them. Hadoop keeps working fine for 7 to 10 days, and then start building inconsistencies over time. After some days, region servers start to go down 1 by 1, and keep throwing back Timeout exceptions and NotServingRegionException.&lt;/P&gt;&lt;P&gt;All the logs are filled with the following exceptions, and if we restart hadoop, it starts working just fine, and inconsistencies automatically go away.&lt;BR /&gt;But we don't want to monitor the system all the time and then manually restart it after every 5 to 10 days.&lt;/P&gt;&lt;PRE&gt;2018-06-10 15:06:30,532 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Aborting procedure (Event_V1_nc_2018-05-15) in zk
2018-06-10 15:06:30,534 INFO org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15 already exists
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Finished creating abort znode:/hbase/flush-table-proc/abort/Event_V1_nc_2018-05-15
2018-06-10 15:06:30,534 DEBUG org.apache.hadoop.hbase.procedure.ProcedureMember: Request received to abort procedure Event_V1_nc_2018-05-15
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@722f85b2:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528635930532, End:1528635990532, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:332)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:619)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528635930532, End:1528635990532, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:69)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)
2018-06-10 15:06:30,535 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15
&lt;/PRE&gt;&lt;PRE&gt;2018-06-11 02:23:33,801 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Master is not running yet
2018-06-11 02:23:33,801 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.
2018-06-11 02:23:36,802 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=prod-ehdb1,60000,1526469728437 with port=60020, startcode=1528676538620
&lt;/PRE&gt;
&lt;PRE&gt;2018-06-11 02:23:09,623 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: error telling master we are up
com.google.protobuf.ServiceException: java.io.IOException: Call to prod-ehdb1/172.16.2.36:60000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired.
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:240)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2295)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:911)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Call to prod-ehdb1/172.16.2.36:60000 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired.
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.wrapException(AbstractRpcClient.java:292)
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1273)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
	... 5 more
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=3, waitTime=10001, operationTimeout=10000 expired.
	at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:73)
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1247)
	... 6 more
2018-06-11 02:23:09,623 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.
2018-06-11 02:23:12,624 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to master=prod-ehdb1,60000,1526469728437 with port=60020, startcode=1528676538620
2018-06-11 02:23:12,784 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Master is not running yet
&lt;/PRE&gt;
&lt;PRE&gt;org.apache.hadoop.hbase.NotServingRegionException: Region EH_ReportProfiles_V1,,1509608701055.7d9965ccdeaab07193c53deffa6cd876. is not online on prod-ehdb2,60020,1526469743500
	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2921)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1053)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2384)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33648)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:185)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:165)
&lt;/PRE&gt;&lt;PRE&gt;/hbase/data/default/EH_EhafCache_V1/7c155487897f0b21b1e204594bd767a6/.tmp/e33a609842144f24aaac3fbd3634e93c retrying...
2018-06-11 02:37:01,824 WARN org.apache.hadoop.hbase.regionserver.HStore: Failed flushing store file, retrying num=4
java.io.IOException: Unable to close file because the last block does not have enough number of replicas.
	at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2678)
	at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2640)
	at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2604)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
	at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:248)
&lt;/PRE&gt;
&lt;PRE&gt;2018-06-10 11:25:17,693 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2ea555d4:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:332)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:142)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$200(ZKProcedureMemberRpcs.java:55)
	at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:110)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:640)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:69)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)
2018-06-10 11:25:17,693 ERROR org.apache.hadoop.hbase.procedure.Subprocedure: Propagating foreign exception to subprocedure Event_V1_nc_2018-05-15
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2ea555d4:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1528622657688, End:1528622717688, diff:60000, max:60000 ms
	at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
&lt;/PRE&gt;</description>
    <pubDate>Wed, 13 Jun 2018 20:01:59 GMT</pubDate>
    <dc:creator>Farhan</dc:creator>
    <dc:date>2018-06-13T20:01:59Z</dc:date>
  </channel>
</rss>

