Reply
Contributor
Posts: 118
Registered: ‎10-15-2014

over 60k CLOSE_WAIT on cluster restart

I am getting very strange behavior that requires me to restart the cluster each time to clear it

For context I am on 

Cloudera version 5.8.5

Hadoop 2.6.0-cdh5.8.5
Subversion http://github.com/cloudera/hadoop -r 47218bf3433a3c3e52036f79d99a597fed09f261
Compiled by jenkins on 2017-05-11T21:06Z
Compiled with protoc 2.5.0
From source with checksum 197c50392cb0362b8f23f945ae5aca42
This command was run using /opt/cloudera/parcels/CDH-5.8.5-1.cdh5.8.5.p0.5/jars/hadoop-common-2.6.0-cdh5.8.5.jar
HBase 1.2.0-cdh5.8.5
Source code repository file:///data/jenkins/workspace/generic-package-ubuntu64-14-04/CDH5.8.5-Packaging-HBase-2017-05-11_13-49-18/hbase-1.2.0+cdh5.8.5+263-1.cdh5.8.5.p0.10~trusty revision=Unknown
Compiled by jenkins on Thu May 11 14:14:09 PDT 2017
From source with checksum 888872fd1ae945e40fea73d87f264b23

 

java -version
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

 

After a while I see the following warnings 

Master logs are reporting 

2017-07-29 19:33:51,775 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=351, exceptions:
Sat Jul 29 19:33:51 UTC 2017, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=78531: row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hadoop1-private.sjc03.infra.ripple.com,60020,1501335437100, seqNum=0

	at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:286)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:231)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:61)
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
	at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
	at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
	at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
	at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
	at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:867)
	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193)
	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
	at org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:183)
	at org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:135)
	at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:236)
	at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:117)
	at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:110)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=78531: row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hadoop1-private.sjc03.infra.ripple.com,60020,1501335437100, seqNum=0
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
	at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
	... 3 more
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): Call queue is full on /0.0.0.0:60020, too many items queued ?
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1268)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:34094)
	at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:400)
	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:204)
	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:65)
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:381)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:355)
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
	... 4 more

Region server logs every once in a while get 

2017-07-29 19:35:44,515 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to accept edit because:
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 3 actions: RemoteWithExtrasException: 3 times,
	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:258)
	at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$2000(AsyncProcess.java:238)
	at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1682)
	at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
	at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:256)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:163)
	at org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:198)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.replicateWALEntry(RSRpcServices.java:1820)
	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22253)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)
	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
	at java.lang.Thread.run(Thread.java:748)

I also see the following errors in the data node logs 

2017-07-29 14:45:39,705 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop5-private.sjc03.infra.ripple.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.160.22.70:54803 dst: /10.160.22.113:50010
java.io.IOException: Premature EOF from inputStream
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:500)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:808)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
	at java.lang.Thread.run(Thread.java:748)

performing the following 

netstat -nap | grep CLOSE_WAIT -c
68699

Digging around I think I am facing https://issues.apache.org/jira/browse/HBASE-9393

Checking the release notes and it doesn't look like this patch was back ported.  

Will this be scheduled?

 

 

Announcements