Created on 11-27-2020 11:06 PM - edited 11-27-2020 11:15 PM
Like this we are getting some times HDFS canery good and some times HDFS Canary Bad
HDFS Canary Good
2 Still Concerning
Nov 27 12:15:53 PM
HDFS Canary Bad
Nov 27 12:15:08 PM
DataNode Health Concerning
Nov 27 11:58:47 AM
DataNode Health Bad
Nov 27 11:58:12 AM
DataNode Health Concerning
Nov 27 10:07:15 AM
DataNode Health Bad
Nov 27 10:07:00 AM
DataNode Health Concerning
Nov 27 9:29:35 AM
DataNode Health Bad
Nov 27 9:29:20 AM
DataNode Health Concerning
Nov 27 8:45:31 AM
DataNode Health Bad
Nov 27 8:45:06 AM
DataNode Health Concerning
Nov 26 10:03 PM
HDFS Canary Good
2 Still Bad
Nov 26 10:02:23 PM
DataNode Health Bad
Nov 26 10:02:18 PM
HDFS Canary Bad
Nov 26 10:01:42 PM
HDFS Canary Good
2 Still Concerning
Nov 26 8:01:53 PM
HDFS Canary Bad
Nov 26 8:01:03 PM
HDFS Canary Good
2 Still Concerning
Nov 26 6:16:18 PM
HDFS Canary Bad
Nov 26 6:15:38 PM
DataNode Health Concerning
Nov 26 4:45:01 PM
DataNode Health Bad
We are finding this logs in service Monitor
12:06:35.706 PM INFO LDBPartitionManager
Expiring partition LDBPartitionMetadataWrapper{tableName=stream, partitionName=stream_2020-11-24T10:05:30.100Z, startTime=2020-11-24T10:05:30.100Z, endTime=2020-11-24T10:55:30.100Z, version=2, state=CLOSED}
12:06:35.706 PM INFO LDBPartitionMetadataStore
Setting partition state=DELETING for partition LDBPartitionMetadataWrapper{tableName=stream, partitionName=stream_2020-11-24T10:05:30.100Z, startTime=2020-11-24T10:05:30.100Z, endTime=2020-11-24T10:55:30.100Z, version=2, state=CLOSED}
12:06:35.717 PM INFO LDBPartitionManager
Couldn't close partition because it was already closed by another thread
12:06:35.718 PM INFO LDBPartitionMetadataStore
Deleting partition LDBPartitionMetadataWrapper{tableName=stream, partitionName=stream_2020-11-24T10:05:30.100Z, startTime=2020-11-24T10:05:30.100Z, endTime=2020-11-24T10:55:30.100Z, version=2, state=CLOSED}
12:06:39.374 PM INFO LDBTimeSeriesRollupManager
Running the LDBTimeSeriesRollupManager at 2020-11-27T10:06:39.374Z, forMigratedData=false
12:11:39.374 PM INFO LDBTimeSeriesRollupManager
Running the LDBTimeSeriesRollupManager at 2020-11-27T10:11:39.374Z, forMigratedData=false
12:11:39.375 PM INFO LDBTimeSeriesRollupManager
Starting rollup from raw to rollup=TEN_MINUTELY for rollupTimestamp=2020-11-27T10:10:00.000Z
12:11:41.505 PM INFO LDBTimeSeriesRollupManager
Finished rollup: duration=PT2.130S, numStreamsChecked=54046, numStreamsRolledUp=18786
12:13:40.962 PM INFO LDBResourceManager
Closed: 0 partitions
12:14:57.535 PM INFO DataStreamer
Exception in createBlockOutputStream blk_1086073148_12332434
java.net.SocketTimeoutException: 13000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.27:47442 remote=/172.27.12:9866]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:537)
at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1762)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
12:14:57.536 PM WARN DataStreamer
Abandoning BP-1768670017-172.-1592847899660:blk_1086073148_12332434
12:14:57.536 PM WARN DataStreamer
Abandoning BP-1768670017- -1592847899660:blk_1086073148_12332434
12:14:57.543 PM WARN DataStreamer
Excluding datanode DatanodeInfoWithStorage[172.27.129.28:9866,DS-211016d1-2920-4748-ba83-46a493759fe3,DISK]
12:15:05.558 PM INFO DataStreamer
Exception in createBlockOutputStream blk_1086073149_12332435
java.net.SocketTimeoutException: 8000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.27.129.30:56202 remote=/172.27.129.29:9866]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:537)
at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1762)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
12:15:05.559 PM WARN DataStreamer
Abandoning BP-1768670017-172.27.0-1592847899660:blk_1086073149_12332435
12:15:05.568 PM WARN DataStreamer
Excluding datanode DatanodeInfoWithStorage[172.27.:9866,DS-5696ff0f-56d5-4dab-b0c3-5fbdde410da4,DISK]
12:15:05.573 PM WARN DataStreamer
this are my cluster values. we thinking this values are issue
dfs.socket.timeout : 3000
dfs.datanode.socket.write.timeout :3000
we are found internet this values like this. this is the issue are any other
dfs.socket.timeout : 60000
dfs.datanode.socket.write.timeout :480000
Created 11-29-2020 09:35 AM
@Raj77 I agree with your analysis you can give it a try. Even for some cases I have seen this value set to very high like below:
HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml dfs.client.socket-timeout 3000000 dfs.datanode.socket.write.timeout 3000000
Created 11-29-2020 09:35 AM
@Raj77 I agree with your analysis you can give it a try. Even for some cases I have seen this value set to very high like below:
HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml dfs.client.socket-timeout 3000000 dfs.datanode.socket.write.timeout 3000000
Created 12-11-2020 06:08 AM
Thanks for your response. we are configuring 60000. present it is ok