Created 01-31-2018 09:07 AM
need advice why we get the error about - Failed to replace a bad datanode on the existing pipeline due to no more good datanodes?
I saw also other quastion that talk about my problem -https://community.hortonworks.com/questions/27153/getting-ioexception-failed-to-replace-a-bad-datano.html
Log description :
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[34.2.31.31:50010,DS-8234bb39-0fd4-49be-98ba-32080bc24fa9,DISK], DatanodeInfoWithStorage[34.2.31.33:50010,DS-b4758979-52a2-4238-99f0-1b5ec45a7e25,DISK]], original=[DatanodeInfoWithStorage[34.2.31.31:50010,DS-8234bb39-0fd4-49be-98ba-32080bc24fa9,DISK], DatanodeInfoWithStorage[34.2.31.33:50010,DS-b4758979-52a2-4238-99f0-1b5ec45a7e25,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1036) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1110) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1268) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:993) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:500) ---2018-01-30T15:15:15.015 INFO [][][] [dal.locations.LocationsDataFramesHandler]
Created 01-31-2018 11:52 AM
The two properties 'dfs.client.block.write.replace-datanode-on-failure.policy' and 'dfs.client.block.write.replace-data node-on-failure.enable' influences the client side behavior for the pipeline recovery and these properties can be added as custom properties in the "hdfs-site" configuration.
Continuous network issues causing or repeated packet drops can lead to such issues. This specially happens when data is being written to any one of the DataNode which is in process of pipelining the data to next datanode and due to any communicaiton issue it may lead to pipeline failure. It can also happen when HDFS client hangs or observs connection timesout due to some memory contention smaller heap size or ulimits.
.
So please check if your DataNodes are healthy and there is no N/W packet drop or communication issue.
Created 01-31-2018 11:52 AM
The two properties 'dfs.client.block.write.replace-datanode-on-failure.policy' and 'dfs.client.block.write.replace-data node-on-failure.enable' influences the client side behavior for the pipeline recovery and these properties can be added as custom properties in the "hdfs-site" configuration.
Continuous network issues causing or repeated packet drops can lead to such issues. This specially happens when data is being written to any one of the DataNode which is in process of pipelining the data to next datanode and due to any communicaiton issue it may lead to pipeline failure. It can also happen when HDFS client hangs or observs connection timesout due to some memory contention smaller heap size or ulimits.
.
So please check if your DataNodes are healthy and there is no N/W packet drop or communication issue.
Created 01-31-2018 12:02 PM
The following article can help in understanding more about these properties:
Created 01-31-2018 12:05 PM
@jay please advice what is the best way to check DataNodes are healthy?
Created 01-31-2018 12:11 PM
May be we can try running some HDFS Service checks form AmbarI UI.
Checking the DataNode logs can also give us some idea like if they are sufferring from Memory limitations or if there are some repeated errors. We can check the DataNode memory utilization to see if they have enough memory and how much is being used currently.
# $JAVA_HOME/bin/jmap -heap $DATANODE_PID
- We can also check if the DataNode ports are accessible from other nodes and if there is any communication issue. From one datanode host please check if we can connect to other datanode port.
# telnet $DATANODE_HOSTNAME $DATANODE_PORT
.
Created 01-31-2018 12:17 PM
from hdfs dfsadmin -report what we can do with this - Missing blocks ( anythuing to do regarding that ? ) we got <br>hdfs dfsadmin -report Configured Capacity: 8226130288640 (7.48 TB) Present Capacity: 8225508617182 (7.48 TB) DFS Remaining: 8205858544606 (7.46 TB) DFS Used: 19650072576 (18.30 GB) DFS Used%: 0.24% Under replicated blocks: 4 Blocks with corrupt replicas: 0 Missing blocks: 4 Missing blocks (with replication factor 1): 0
Created 01-31-2018 12:22 PM
Created 01-31-2018 12:25 PM
dfsadmin report might not be very helpful here.
Regarding the DataNode PID we can do any of the following to find out the PID of DataNode:
# ps -ef | grep DataNode (OR) # cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid
.
Also if you want to list the ports used by the DataNode then you can run the following command:
# netstat -tnlpa | grep `cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid`
(OR) # netstat -tnlpa | grep $DATANODE_PID
Created 01-31-2018 12:56 PM
@Jay not see from the output something negetive , or maybe you want to add your opinion
/usr/jdk64/jdk1.8.0_112/bin/jmap -heap 26765 Attaching to process ID 26765, please wait... Debugger attached successfully. Server compiler detected. JVM version is 25.112-b15 using parallel threads in the new generation. using thread-local object allocation. Concurrent Mark-Sweep GC Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 1073741824 (1024.0MB) NewSize = 209715200 (200.0MB) MaxNewSize = 209715200 (200.0MB) OldSize = 864026624 (824.0MB) NewRatio = 2 SurvivorRatio = 8 MetaspaceSize = 21807104 (20.796875MB) CompressedClassSpaceSize = 1073741824 (1024.0MB) MaxMetaspaceSize = 17592186044415 MB G1HeapRegionSize = 0 (0.0MB) Heap Usage: New Generation (Eden + 1 Survivor Space): capacity = 188743680 (180.0MB) used = 13146000 (12.537002563476562MB) free = 175597680 (167.46299743652344MB) 6.9650014241536455% used Eden Space: capacity = 167772160 (160.0MB) used = 7374968 (7.033317565917969MB) free = 160397192 (152.96668243408203MB) 4.3958234786987305% used From Space: capacity = 20971520 (20.0MB) used = 5771032 (5.503684997558594MB) free = 15200488 (14.496315002441406MB) 27.51842498779297% used To Space: capacity = 20971520 (20.0MB) used = 0 (0.0MB) free = 20971520 (20.0MB) 0.0% used concurrent mark-sweep generation: capacity = 864026624 (824.0MB) used = 25506528 (24.324920654296875MB) free = 838520096 (799.6750793457031MB) 2.952053477463213% used
Created 01-24-2019 01:47 PM
Once the file is corruputed, you cannot recover from this even after setting dfs.client.block.write.replace-datanode-on-failure.policy=NEVER and restarting HDFS. As a work-around, I created a copy of the file and removed the old one.