Member since
02-18-2016
141
Posts
19
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2882 | 12-18-2019 07:44 PM | |
2912 | 12-15-2019 07:40 PM | |
1165 | 12-03-2019 06:29 AM | |
1184 | 12-02-2019 06:47 AM | |
3516 | 11-28-2019 02:06 AM |
09-28-2023
09:32 PM
Hi @willx I was referring to [HDFS-14758] Decrease lease hard limit - ASF JIRA (apache.org) Not sure if i can reduce hbase.lease.recovery.timeout since even if i reduce to 10mins it will not solve the problem. My question is - "since there is already good replica on UP datanode, why its connecting to dead datanode!! " I came up with below findings. Pls check if you also think same. 1. IPC client retries Reducing this value from 50 to 10
-- Active datanode is trying to connect to dead datanode for 15mins. Reducing the retries can close the connect fast with dead datanode.
2. hbase.lease.recovery.timeout Reducing this from 15mins to 10mins/less
-- HBase has hard lease recovery setting in case of abrupt shutdown of Master. Reducing this timeout can recover/release lease sooner.
3. Replication Factor Increasing replication factor from 3 to 5 only for “MasterProcWAL” directory
-- Changing replication factor to greater number can increase chances of availability of good replica for recovery
4. Rack Topology Modify rack topology to logically distribute replicas from 2 zones to 3 zones.
-- As per rack topology concept data is distributed in 2 racks only. The existing state is as below. Hence when block recovery takes place, it tries to refer rack_topology file and finds both datanodes are dead from respective zone.
Adding logical rack3, will distribute the block replica across 3 different datanodes and chances of getting the block recovery will be high.
... View more
09-27-2023
05:57 AM
Datanode Logs: ics029027095.ics-eu-2.asml.com/172.29.27.95 2023-08-11T02:57:52.973+02:00 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: recoverBlocks FAILED: RecoveringBlock{BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444; getBlockSize()=286; corrupt=false; offset=-1; locs=[DatanodeInfoWithStorage[172.29.27.95:50010,null,null], DatanodeInfoWithStorage[172.29.226.185:50010,null,null], DatanodeInfoWithStorage[172.29.27.91:50010,null,null]]}
java.io.IOException: Cannot recover BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, the following datanodes failed: [DatanodeInfoWithStorage[172.29.27.95:50010,null,null]]
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:314)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
2023-08-11T02:57:52.973+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: updateReplica: BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW], recoveryId=7619002, length=2851866, replica=FinalizedReplica, blk_1081354547_7619028, FINALIZED
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/finalized/subdir20/subdir9/blk_1081354547
2023-08-11T02:57:52.972+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444 (length=286), bestState=RBW, newBlock=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7619002 (length=2851866), participatingList=[block:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW] node:DatanodeInfoWithStorage[172.29.27.95:50010,null,null]]
2023-08-11T02:57:52.972+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444 (length=286), isTruncateRecovery=false, syncList=[block:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW] node:DatanodeInfoWithStorage[172.29.27.95:50010,null,null], block:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RWR] node:DatanodeInfoWithStorage[172.29.27.91:50010,null,null]]
2023-08-11T02:57:52.967+02:00 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to recover block (block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, datanode=DatanodeInfoWithStorage[172.29.226.185:50010,null,null])
org.apache.hadoop.net.ConnectTimeoutException: Call From ics029027095.ics-eu-2.asml.com/172.29.27.95 to ics029226185.ics-eu-2.asml.com:8010 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ics029226185.ics-eu-2.asml.com/172.29.226.185:8010]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:775)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1502)
at org.apache.hadoop.ipc.Client.call(Client.java:1444)
at org.apache.hadoop.ipc.Client.call(Client.java:1354)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy24.initReplicaRecovery(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatanodeProtocolTranslatorPB.java:83)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.callInitReplicaRecovery(BlockRecoveryWorker.java:565)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.access$400(BlockRecoveryWorker.java:57)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:134)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ics029226185.ics-eu-2.asml.com/172.29.226.185:8010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:688)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:791)
at org.apache.hadoop.ipc.Client$Connection.access$3600(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1559)
at org.apache.hadoop.ipc.Client.call(Client.java:1390)
... 10 more
2023-08-11T02:45:05.399+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: updateReplica: BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW], recoveryId=7619028, length=2851866, replica=ReplicaUnderRecovery, blk_1081354547_7617444, RUR
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
recoveryId=7619028
original=ReplicaBeingWritten, blk_1081354547_7617444, RBW
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
bytesAcked=2851866
bytesOnDisk=2851866
2023-08-11T02:45:05.399+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444 (length=286), bestState=RBW, newBlock=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7619028 (length=2851866), participatingList=[block:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW] node:DatanodeInfoWithStorage[172.29.27.95:50010,null,null]]
2023-08-11T02:45:05.399+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444 (length=286), isTruncateRecovery=false, syncList=[block:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW] node:DatanodeInfoWithStorage[172.29.27.95:50010,null,null]]
2023-08-11T02:45:05.398+02:00 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to recover block (block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, datanode=DatanodeInfoWithStorage[172.29.27.91:50010,null,null])
java.net.NoRouteToHostException: No Route to Host from ics029027095.ics-eu-2.asml.com/172.29.27.95 to ics029027091.ics-eu-2.asml.com:8010 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:782)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1502)
at org.apache.hadoop.ipc.Client.call(Client.java:1444)
at org.apache.hadoop.ipc.Client.call(Client.java:1354)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy24.initReplicaRecovery(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatanodeProtocolTranslatorPB.java:83)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.callInitReplicaRecovery(BlockRecoveryWorker.java:565)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.access$400(BlockRecoveryWorker.java:57)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:134)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:688)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:791)
at org.apache.hadoop.ipc.Client$Connection.access$3600(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1559)
at org.apache.hadoop.ipc.Client.call(Client.java:1390)
... 10 more
2023-08-11T02:42:32.088+02:00 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to recover block (block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, datanode=DatanodeInfoWithStorage[172.29.226.185:50010,null,null])
org.apache.hadoop.net.ConnectTimeoutException: Call From ics029027095.ics-eu-2.asml.com/172.29.27.95 to ics029226185.ics-eu-2.asml.com:8010 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ics029226185.ics-eu-2.asml.com/172.29.226.185:8010]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:775)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1502)
at org.apache.hadoop.ipc.Client.call(Client.java:1444)
at org.apache.hadoop.ipc.Client.call(Client.java:1354)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy24.initReplicaRecovery(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatanodeProtocolTranslatorPB.java:83)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.callInitReplicaRecovery(BlockRecoveryWorker.java:565)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.access$400(BlockRecoveryWorker.java:57)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:134)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ics029226185.ics-eu-2.asml.com/172.29.226.185:8010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:688)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:791)
at org.apache.hadoop.ipc.Client$Connection.access$3600(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1559)
at org.apache.hadoop.ipc.Client.call(Client.java:1390)
... 10 more
2023-08-11T02:29:44.570+02:00 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: recoverBlocks FAILED: RecoveringBlock{BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444; getBlockSize()=286; corrupt=false; offset=-1; locs=[DatanodeInfoWithStorage[172.29.27.95:50010,null,null], DatanodeInfoWithStorage[172.29.226.185:50010,null,null], DatanodeInfoWithStorage[172.29.27.91:50010,null,null]]}
java.io.IOException: Cannot recover BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, the following datanodes failed: [DatanodeInfoWithStorage[172.29.27.95:50010,null,null]]
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:314)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
2023-08-11T02:29:44.570+02:00 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to updateBlock (newblock=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7619019, datanode=DatanodeInfoWithStorage[172.29.27.95:50010,null,null])
java.io.IOException: rur.getRecoveryID() != recoveryId = 7619019, rur=ReplicaUnderRecovery, blk_1081354547_7617444, RUR
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
recoveryId=7619028
original=ReplicaBeingWritten, blk_1081354547_7617444, RBW
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
bytesAcked=2851866
bytesOnDisk=2851866
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.updateReplicaUnderRecovery(FsDatasetImpl.java:2727)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.updateReplicaUnderRecovery(FsDatasetImpl.java:2691)
at org.apache.hadoop.hdfs.server.datanode.DataNode.updateReplicaUnderRecovery(DataNode.java:2917)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:302)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
2023-08-11T02:29:44.570+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: updateReplica: BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW], recoveryId=7619019, length=2851866, replica=ReplicaUnderRecovery, blk_1081354547_7617444, RUR
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
recoveryId=7619028
original=ReplicaBeingWritten, blk_1081354547_7617444, RBW
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
bytesAcked=2851866
bytesOnDisk=2851866
2023-08-11T02:29:44.570+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444 (length=286), bestState=RBW, newBlock=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7619019 (length=2851866), participatingList=[block:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW] node:DatanodeInfoWithStorage[172.29.27.95:50010,null,null]]
2023-08-11T02:29:44.570+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444 (length=286), isTruncateRecovery=false, syncList=[block:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW] node:DatanodeInfoWithStorage[172.29.27.95:50010,null,null]]
2023-08-11T02:29:44.569+02:00 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to recover block (block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, datanode=DatanodeInfoWithStorage[172.29.27.91:50010,null,null])
java.net.NoRouteToHostException: No Route to Host from ics029027095.ics-eu-2.asml.com/172.29.27.95 to ics029027091.ics-eu-2.asml.com:8010 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:782)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1502)
at org.apache.hadoop.ipc.Client.call(Client.java:1444)
at org.apache.hadoop.ipc.Client.call(Client.java:1354)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy24.initReplicaRecovery(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatanodeProtocolTranslatorPB.java:83)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.callInitReplicaRecovery(BlockRecoveryWorker.java:565)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.access$400(BlockRecoveryWorker.java:57)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:134)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:688)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:791)
at org.apache.hadoop.ipc.Client$Connection.access$3600(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1559)
at org.apache.hadoop.ipc.Client.call(Client.java:1390)
... 10 more
2023-08-11T02:27:11.259+02:00 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to recover block (block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, datanode=DatanodeInfoWithStorage[172.29.226.185:50010,null,null])
org.apache.hadoop.net.ConnectTimeoutException: Call From ics029027095.ics-eu-2.asml.com/172.29.27.95 to ics029226185.ics-eu-2.asml.com:8010 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ics029226185.ics-eu-2.asml.com/172.29.226.185:8010]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:775)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1502)
at org.apache.hadoop.ipc.Client.call(Client.java:1444)
at org.apache.hadoop.ipc.Client.call(Client.java:1354)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy24.initReplicaRecovery(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatanodeProtocolTranslatorPB.java:83)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.callInitReplicaRecovery(BlockRecoveryWorker.java:565)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.access$400(BlockRecoveryWorker.java:57)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:134)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ics029226185.ics-eu-2.asml.com/172.29.226.185:8010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:688)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:791)
at org.apache.hadoop.ipc.Client$Connection.access$3600(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1559)
at org.apache.hadoop.ipc.Client.call(Client.java:1390)
... 10 more
2023-08-11T02:14:23.692+02:00 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: recoverBlocks FAILED: RecoveringBlock{BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444; getBlockSize()=286; corrupt=false; offset=-1; locs=[DatanodeInfoWithStorage[172.29.27.95:50010,null,null], DatanodeInfoWithStorage[172.29.226.185:50010,null,null], DatanodeInfoWithStorage[172.29.27.91:50010,null,null]]}
java.io.IOException: Cannot recover BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, the following datanodes failed: [DatanodeInfoWithStorage[172.29.27.95:50010,null,null]]
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:314)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
2023-08-11T02:14:23.691+02:00 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to updateBlock (newblock=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7618941, datanode=DatanodeInfoWithStorage[172.29.27.95:50010,null,null])
java.io.IOException: rur.getRecoveryID() != recoveryId = 7618941, rur=ReplicaUnderRecovery, blk_1081354547_7617444, RUR
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
recoveryId=7619028
original=ReplicaBeingWritten, blk_1081354547_7617444, RBW
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
bytesAcked=2851866
bytesOnDisk=2851866
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.updateReplicaUnderRecovery(FsDatasetImpl.java:2727)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.updateReplicaUnderRecovery(FsDatasetImpl.java:2691)
at org.apache.hadoop.hdfs.server.datanode.DataNode.updateReplicaUnderRecovery(DataNode.java:2917)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:302)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
2023-08-11T02:14:23.691+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: updateReplica: BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW], recoveryId=7618941, length=2851866, replica=ReplicaUnderRecovery, blk_1081354547_7617444, RUR
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
recoveryId=7619028
original=ReplicaBeingWritten, blk_1081354547_7617444, RBW
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
bytesAcked=2851866
bytesOnDisk=2851866
2023-08-11T02:14:23.691+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444 (length=286), bestState=RBW, newBlock=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7618941 (length=2851866), participatingList=[block:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW] node:DatanodeInfoWithStorage[172.29.27.95:50010,null,null]]
2023-08-11T02:14:23.691+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444 (length=286), isTruncateRecovery=false, syncList=[block:blk_1081354547_7617444[numBytes=2851866,originalReplicaState=RBW] node:DatanodeInfoWithStorage[172.29.27.95:50010,null,null]]
2023-08-11T02:14:23.691+02:00 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to recover block (block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, datanode=DatanodeInfoWithStorage[172.29.27.91:50010,null,null])
java.net.NoRouteToHostException: No Route to Host from ics029027095.ics-eu-2.asml.com/172.29.27.95 to ics029027091.ics-eu-2.asml.com:8010 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:782)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1502)
at org.apache.hadoop.ipc.Client.call(Client.java:1444)
at org.apache.hadoop.ipc.Client.call(Client.java:1354)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy24.initReplicaRecovery(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatanodeProtocolTranslatorPB.java:83)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.callInitReplicaRecovery(BlockRecoveryWorker.java:565)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.access$400(BlockRecoveryWorker.java:57)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:134)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:688)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:791)
at org.apache.hadoop.ipc.Client$Connection.access$3600(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1559)
at org.apache.hadoop.ipc.Client.call(Client.java:1390)
... 10 more
2023-08-11T02:12:35.608+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: update recovery id for blk_1081354547_7617444 from 7619019 to 7619028
2023-08-11T02:12:35.608+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1081354547_7617444, recoveryId=7619028, replica=ReplicaUnderRecovery, blk_1081354547_7617444, RUR
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
recoveryId=7619019
original=ReplicaBeingWritten, blk_1081354547_7617444, RBW
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
bytesAcked=2851866
bytesOnDisk=2851866
2023-08-11T02:12:35.608+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: NameNode at ics029027083.ics-eu-2.asml.com/172.29.27.83:8020 calls recoverBlock(BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, targets=[DatanodeInfoWithStorage[172.29.27.95:50010,null,null], DatanodeInfoWithStorage[172.29.226.185:50010,null,null], DatanodeInfoWithStorage[172.29.27.91:50010,null,null]], newGenerationStamp=7619028, newBlock=null, isStriped=false)
2023-08-11T02:11:50.383+02:00 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to recover block (block=BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, datanode=DatanodeInfoWithStorage[172.29.226.185:50010,null,null])
org.apache.hadoop.net.ConnectTimeoutException: Call From ics029027095.ics-eu-2.asml.com/172.29.27.95 to ics029226185.ics-eu-2.asml.com:8010 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ics029226185.ics-eu-2.asml.com/172.29.226.185:8010]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:775)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1502)
at org.apache.hadoop.ipc.Client.call(Client.java:1444)
at org.apache.hadoop.ipc.Client.call(Client.java:1354)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy24.initReplicaRecovery(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatanodeProtocolTranslatorPB.java:83)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.callInitReplicaRecovery(BlockRecoveryWorker.java:565)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.access$400(BlockRecoveryWorker.java:57)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:134)
at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:604)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ics029226185.ics-eu-2.asml.com/172.29.226.185:8010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:688)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:791)
at org.apache.hadoop.ipc.Client$Connection.access$3600(Client.java:411)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1559)
at org.apache.hadoop.ipc.Client.call(Client.java:1390)
... 10 more
2023-08-11T02:07:14.599+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: update recovery id for blk_1081354547_7617444 from 7619002 to 7619019
2023-08-11T02:07:14.599+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1081354547_7617444, recoveryId=7619019, replica=ReplicaUnderRecovery, blk_1081354547_7617444, RUR
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
recoveryId=7619002
original=ReplicaBeingWritten, blk_1081354547_7617444, RBW
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
bytesAcked=2851866
bytesOnDisk=2851866
2023-08-11T02:07:14.599+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: NameNode at ics029027083.ics-eu-2.asml.com/172.29.27.83:8020 calls recoverBlock(BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, targets=[DatanodeInfoWithStorage[172.29.27.95:50010,null,null], DatanodeInfoWithStorage[172.29.226.185:50010,null,null], DatanodeInfoWithStorage[172.29.27.91:50010,null,null]], newGenerationStamp=7619019, newBlock=null, isStriped=false)
2023-08-11T02:02:59.595+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: update recovery id for blk_1081354547_7617444 from 7618941 to 7619002
2023-08-11T02:02:59.595+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1081354547_7617444, recoveryId=7619002, replica=ReplicaUnderRecovery, blk_1081354547_7617444, RUR
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
recoveryId=7618941
original=ReplicaBeingWritten, blk_1081354547_7617444, RBW
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
bytesAcked=2851866
bytesOnDisk=2851866
2023-08-11T02:02:59.594+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: NameNode at ics029027083.ics-eu-2.asml.com/172.29.27.83:8020 calls recoverBlock(BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, targets=[DatanodeInfoWithStorage[172.29.27.95:50010,null,null], DatanodeInfoWithStorage[172.29.226.185:50010,null,null], DatanodeInfoWithStorage[172.29.27.91:50010,null,null]], newGenerationStamp=7619002, newBlock=null, isStriped=false)
2023-08-11T01:56:29.586+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: changing replica state for blk_1081354547_7617444 from RBW to RUR
2023-08-11T01:56:29.586+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1081354547_7617444, recoveryId=7618941, replica=ReplicaBeingWritten, blk_1081354547_7617444, RBW
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
bytesAcked=2851866
bytesOnDisk=2851866
2023-08-11T01:56:29.586+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1081354547_7617444, recoveryId=7618941, replica=ReplicaBeingWritten, blk_1081354547_7617444, RBW
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= 2851866
getVolume() = /hadoop4/hdfs/data
getBlockURI() = file:/hadoop4/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547
bytesAcked=2851866
bytesOnDisk=2851866
2023-08-11T01:56:29.585+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockRecoveryWorker: NameNode at ics029027083.ics-eu-2.asml.com/172.29.27.83:8020 calls recoverBlock(BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, targets=[DatanodeInfoWithStorage[172.29.27.95:50010,null,null], DatanodeInfoWithStorage[172.29.226.185:50010,null,null], DatanodeInfoWithStorage[172.29.27.91:50010,null,null]], newGenerationStamp=7618941, newBlock=null, isStriped=false)
2023-08-11T01:55:59.759+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444 received exception java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.29.27.95:50010 remote=/172.29.27.69:33920]
2023-08-11T01:55:59.759+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=2:[172.29.226.185:50010, 172.29.27.91:50010] terminating
2023-08-11T01:55:59.759+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=2:[172.29.226.185:50010, 172.29.27.91:50010]: Thread is interrupted.
2023-08-11T01:55:59.758+02:00 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-153760920-172.29.27.83-1654853018310:blk_1081354547_7617444
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.29.27.95:50010 remote=/172.29.27.69:33920]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:210)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:211)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:528)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:971)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:891)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
at java.lang.Thread.run(Thread.java:748) Datanode Logs: ics029027091.ics-eu-2.asml.com/172.29.27.91 2023-08-11T02:57:52.972+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: changing replica state for blk_1081354547_7617444 from RWR to RUR
2023-08-11T02:57:52.971+02:00 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1081354547_7617444, recoveryId=7619002, replica=ReplicaWaitingToBeRecovered, blk_1081354547_7617444, RWR
getNumBytes() = 2851866
getBytesOnDisk() = 2851866
getVisibleLength()= -1
getVolume() = /hadoop1/hdfs/data
getBlockURI() = file:/hadoop1/hdfs/data/current/BP-153760920-172.29.27.83-1654853018310/current/rbw/blk_1081354547 "172.29.27.95 -> Zone2 -> Up
172.29.226.185 -> Zone1 -> Down
172.29.27.91 -> Zone1 -> Down" Topology mapping data script as below -
... View more
09-27-2023
05:49 AM
Hi @willx pls find below - Hbase splunk query for masterprocwal - Namenode Logs for respective - pv-****.log
... View more
09-26-2023
07:11 AM
Hi @willx Thank you for reply. 8010 is datanode ipc address in my cluster . Pls check snap below - But still why its connecting to down datanode (ics168226185.ics-eu-2.example.com) for 16mins !!!
... View more
09-13-2023
09:40 PM
Hello all, I am facing below issue, Can you pls help how to address - I have cluster 16 datanodes in cluster divided into 2 zones. Zone1 and zone2. Cluster: HDP3.1 and HBase 2.0.2 Zone1 has datanode 192.168.226.185 and 192.168.27.91 Zone2 has datanode 192.168.27.95 Hbase master is active is running in zone 1 and Standby in zone2. Our rack topology file is as below - [network_topology]
ics168226185.ics-eu-2.example.com=/1
192.168.226.185=/1
ics168027091.ics-eu-2.example.com=/1
192.168.27.91=/1
ics168027095.ics-eu-2.example.com=/2
192.168.27.95=/2 We perform disaster recovery where we intentionally took network down for zone1. Now since zone1 active Hbase master was down, zone2 hbase master tries to take over. But it takes more than 16mins for hbase to come up. Below was the error seen in hbase master logs - INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Failed to recover lease, attempt=6 on file=hdfs://example-test-cluster/apps/hbase/data/MasterProcWALs/pv2-00000000000000010950.log after 965307ms We tried to check why lease is being hold and we see below logs in namenode - WARN org.apache.hadoop.hdfs.StateChange:DIR* NameSystem.internalReleaseLease:File/apps/hbase/data/MasterProcWALs/pv2-00000000000000010950.log has not been closed.Leaserecoveryisinprogress. RecoveryId=7619028 for block blk_1081354547_7617444 We tried to check datanode logs for the respective block id - blk_1081354547_7617444 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to recover block (block=BP-153760920-192.168.27.83-1654853018310:blk_1081354547_7617444, datanode=DatanodeInfoWithStorage[192.168.226.185:50010,null,null])
org.apache.hadoop.net.ConnectTimeoutException: Call From ics168027095.ics-eu-2.example.com/192.168.27.95 to ics168226185.ics-eu-2.example.com:8010 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ics168226185.ics-eu-2.example.com/192.168.226.185:8010]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout It s seems that blk_1081354547_7617444 has 3 replicas which are stored on 192.168.27.95, 192.168.226.185 and 192.168.27.91 Why datanode from active zone 2 i.e 192.168.27.95 tries to reach Down datanode from zone1 192.168.226.185 to recover block? Also from datanode logs we see block recovery status gets FINALIZED after 1 hour, they why hbase lease recovery takes 18~20 mins to recover? I have seen pattern, whenever zone1 goes down and datanodes in zone1 has 2 copies of replicas unavailable then only this issue comes. Hbase lease recovery takes more than 16mins. Can you guide how to debug/resolve this issue?
... View more
Labels:
- Labels:
-
Apache HBase
08-27-2022
03:01 PM
Hi Team, We are using Jmeter to submit job (1300/hr) to hbase/phoenix. HDP3.1.4 and Phoenix 5.0 Job starts failing with below error - 2022-08-25 16:21:44,785 INFO org.apache.phoenix.iterate.BaseResultIterators: Failed to execute task during cancel java.util.concurrent.ExecutionException: org.apache.phoenix.exception.PhoenixIOException: org.apache.hadoop.hbase.exceptions.ScannerResetException: Scanner is closed on the server-side at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3468) at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42002) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) Caused by: org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row out of range for Get on HRegion OBST:DOCUMENT_METADATA,\x0C\x00\x00\x00,1659594973530.146ed04497483dae508d10d1e2676a12., startKey='\x0C\x00\x00\x00', getEndKey()='\x0CADELMWSQRP\x004bcdbe31987c05d9e88cba377df31f3bbaae274d7df670ed26690fb021c90f5b\x00PERSISTENT', row='\x0CADELSRD\x009bb7104f2f156cec8ecb0e53f95b72affa43969125732ab898c96282356999f7\x00PERSISTENT' at org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:5713) at org.apache.hadoop.hbase.regionserver.HRegion.prepareGet(HRegion.java:7297) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7290) at org.apache.phoenix.util.IndexUtil.wrapResultUsingOffset(IndexUtil.java:514) at org.apache.phoenix.iterate.RegionScannerFactory$1.nextRaw(RegionScannerFactory.java:197) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:77) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:77) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:274) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3136) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3385) ... 5 more at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.phoenix.iterate.BaseResultIterators.close(BaseResultIterators.java:1439) at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:1352) at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:1239) at org.apache.phoenix.iterate.MergeSortResultIterator.getMinHeap(MergeSortResultIterator.java:72) at org.apache.phoenix.iterate.MergeSortResultIterator.minIterator(MergeSortResultIterator.java:93) at org.apache.phoenix.iterate.MergeSortResultIterator.next(MergeSortResultIterator.java:58) at org.apache.phoenix.iterate.DelegateResultIterator.next(DelegateResultIterator.java:44) at org.apache.phoenix.iterate.LimitingResultIterator.next(LimitingResultIterator.java:47) at org.apache.phoenix.jdbc.PhoenixResultSet.next(PhoenixResultSet.java:805) at org.apache.calcite.avatica.jdbc.JdbcResultSet.frame(JdbcResultSet.java:148) at org.apache.calcite.avatica.jdbc.JdbcResultSet.create(JdbcResultSet.java:101) at org.apache.calcite.avatica.jdbc.JdbcMeta.execute(JdbcMeta.java:887) at org.apache.calcite.avatica.remote.LocalService.apply(LocalService.java:254) at org.apache.calcite.avatica.remote.Service$ExecuteRequest.accept(Service.java:1032) at org.apache.calcite.avatica.remote.Service$ExecuteRequest.accept(Service.java:1002) at org.apache.calcite.avatica.remote.AbstractHandler.apply(AbstractHandler.java:94) at org.apache.calcite.avatica.remote.ProtobufHandler.apply(ProtobufHandler.java:46) at org.apache.calcite.avatica.server.AvaticaProtobufHandler.handle(AvaticaProtobufHandler.java:127) at org.apache.phoenix.shaded.org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) at org.apache.phoenix.shaded.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.apache.phoenix.shaded.org.eclipse.jetty.server.Server.handle(Server.java:539) at org.apache.phoenix.shaded.org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) at org.apache.phoenix.shaded.org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.apache.phoenix.shaded.org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) at org.apache.phoenix.shaded.org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at org.apache.phoenix.shaded.org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.apache.phoenix.shaded.org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.apache.phoenix.shaded.org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.apache.phoenix.shaded.org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.apache.phoenix.shaded.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.apache.phoenix.shaded.org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.phoenix.exception.PhoenixIOException: org.apache.hadoop.hbase.exceptions.ScannerResetException: Scanner is closed on the server-side at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3468) at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42002) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) Caused by: org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row out of range for Get on HRegion OBST:DOCUMENT_METADATA,\x0C\x00\x00\x00,1659594973530.146ed04497483dae508d10d1e2676a12., startKey='\x0C\x00\x00\x00', getEndKey()='\x0CADELMWSQRP\x004bcdbe31987c05d9e88cba377df31f3bbaae274d7df670ed26690fb021c90f5b\x00PERSISTENT', row='\x0CADELSRD\x009bb7104f2f156cec8ecb0e53f95b72affa43969125732ab898c96282356999f7\x00PERSISTENT' Same time we tried to check "select count(*)" with and without index but it gives difference as shown below - NOTE: Below output is from test cluster where we were able to repro issue. View name might differ in below screenshot - We suspected below apache bug for "WrongRegionException: Requested row out of range for Get on HRegion" - https://issues.apache.org/jira/browse/PHOENIX-3828 For "select count(*)" mismatch - we suspected we are hitting - [PHOENIX-6090] Local indexes get out of sync after changes for global consistent indexes - ASF JIRA (apache.org) Can someone help on debugging steps?
... View more
Labels:
- Labels:
-
Apache Phoenix
12-18-2019
10:37 PM
Hi @Daggers Please feel free to select best answer if your questions are answered to close the thread. Thanks
... View more
12-18-2019
07:44 PM
Hi @Daggers You can write simple script using yarn rest api to fetch only completed applications [month/daywise] and copy only those applications from hdfs to local. Please check below link - https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html
... View more
12-15-2019
10:52 PM
@Daggers You can also check for HDFS NFS gateway which will allow hdfs filesystem to mount on local OS exposed via NFS. https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
... View more
12-15-2019
07:40 PM
1 Kudo
Hi @Daggers I think you can try this - 1. Below properties decides the path for storing yarn logs in hdfs - Belos is sample example from my cluster -
yarn.nodemanager.remote-app-log-dir = /app-logs
yarn.nodemanager.remote-app-log-dir-suffix = logs-ifile 2. You can do "hadoop dfs -copyToLocal" for above path which will copy all applications to local and then you can pass to splunk ? Do you think that can work for you? Let me know if you have more questions on above.
... View more