- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Failed to replace a bad datanode on the existing pipeline due to no more good datanodes
- Labels:
-
Apache Ambari
-
Apache Hadoop
-
Apache Spark
Created ‎01-31-2018 09:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
need advice why we get the error about - Failed to replace a bad datanode on the existing pipeline due to no more good datanodes?
I saw also other quastion that talk about my problem -https://community.hortonworks.com/questions/27153/getting-ioexception-failed-to-replace-a-bad-datano.html
Log description :
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[34.2.31.31:50010,DS-8234bb39-0fd4-49be-98ba-32080bc24fa9,DISK], DatanodeInfoWithStorage[34.2.31.33:50010,DS-b4758979-52a2-4238-99f0-1b5ec45a7e25,DISK]], original=[DatanodeInfoWithStorage[34.2.31.31:50010,DS-8234bb39-0fd4-49be-98ba-32080bc24fa9,DISK], DatanodeInfoWithStorage[34.2.31.33:50010,DS-b4758979-52a2-4238-99f0-1b5ec45a7e25,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1036) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1110) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1268) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:993) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:500) ---2018-01-30T15:15:15.015 INFO [][][] [dal.locations.LocationsDataFramesHandler]
Created ‎01-31-2018 11:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The two properties 'dfs.client.block.write.replace-datanode-on-failure.policy' and 'dfs.client.block.write.replace-data node-on-failure.enable' influences the client side behavior for the pipeline recovery and these properties can be added as custom properties in the "hdfs-site" configuration.
Continuous network issues causing or repeated packet drops can lead to such issues. This specially happens when data is being written to any one of the DataNode which is in process of pipelining the data to next datanode and due to any communicaiton issue it may lead to pipeline failure. It can also happen when HDFS client hangs or observs connection timesout due to some memory contention smaller heap size or ulimits.
.
So please check if your DataNodes are healthy and there is no N/W packet drop or communication issue.
Created ‎01-31-2018 11:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The two properties 'dfs.client.block.write.replace-datanode-on-failure.policy' and 'dfs.client.block.write.replace-data node-on-failure.enable' influences the client side behavior for the pipeline recovery and these properties can be added as custom properties in the "hdfs-site" configuration.
Continuous network issues causing or repeated packet drops can lead to such issues. This specially happens when data is being written to any one of the DataNode which is in process of pipelining the data to next datanode and due to any communicaiton issue it may lead to pipeline failure. It can also happen when HDFS client hangs or observs connection timesout due to some memory contention smaller heap size or ulimits.
.
So please check if your DataNodes are healthy and there is no N/W packet drop or communication issue.
Created ‎01-31-2018 12:02 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following article can help in understanding more about these properties:
Created ‎01-31-2018 12:05 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@jay please advice what is the best way to check DataNodes are healthy?
Created ‎01-31-2018 12:11 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
May be we can try running some HDFS Service checks form AmbarI UI.
Checking the DataNode logs can also give us some idea like if they are sufferring from Memory limitations or if there are some repeated errors. We can check the DataNode memory utilization to see if they have enough memory and how much is being used currently.
# $JAVA_HOME/bin/jmap -heap $DATANODE_PID
- We can also check if the DataNode ports are accessible from other nodes and if there is any communication issue. From one datanode host please check if we can connect to other datanode port.
# telnet $DATANODE_HOSTNAME $DATANODE_PORT
.
Created ‎01-31-2018 12:17 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
from hdfs dfsadmin -report what we can do with this - Missing blocks ( anythuing to do regarding that ? ) we got <br>hdfs dfsadmin -report Configured Capacity: 8226130288640 (7.48 TB) Present Capacity: 8225508617182 (7.48 TB) DFS Remaining: 8205858544606 (7.46 TB) DFS Used: 19650072576 (18.30 GB) DFS Used%: 0.24% Under replicated blocks: 4 Blocks with corrupt replicas: 0 Missing blocks: 4 Missing blocks (with replication factor 1): 0
Created ‎01-31-2018 12:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎01-31-2018 12:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
dfsadmin report might not be very helpful here.
Regarding the DataNode PID we can do any of the following to find out the PID of DataNode:
# ps -ef | grep DataNode (OR) # cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid
.
Also if you want to list the ports used by the DataNode then you can run the following command:
# netstat -tnlpa | grep `cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid`
(OR) # netstat -tnlpa | grep $DATANODE_PID
Created ‎01-31-2018 12:56 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Jay not see from the output something negetive , or maybe you want to add your opinion
/usr/jdk64/jdk1.8.0_112/bin/jmap -heap 26765 Attaching to process ID 26765, please wait... Debugger attached successfully. Server compiler detected. JVM version is 25.112-b15 using parallel threads in the new generation. using thread-local object allocation. Concurrent Mark-Sweep GC Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 1073741824 (1024.0MB) NewSize = 209715200 (200.0MB) MaxNewSize = 209715200 (200.0MB) OldSize = 864026624 (824.0MB) NewRatio = 2 SurvivorRatio = 8 MetaspaceSize = 21807104 (20.796875MB) CompressedClassSpaceSize = 1073741824 (1024.0MB) MaxMetaspaceSize = 17592186044415 MB G1HeapRegionSize = 0 (0.0MB) Heap Usage: New Generation (Eden + 1 Survivor Space): capacity = 188743680 (180.0MB) used = 13146000 (12.537002563476562MB) free = 175597680 (167.46299743652344MB) 6.9650014241536455% used Eden Space: capacity = 167772160 (160.0MB) used = 7374968 (7.033317565917969MB) free = 160397192 (152.96668243408203MB) 4.3958234786987305% used From Space: capacity = 20971520 (20.0MB) used = 5771032 (5.503684997558594MB) free = 15200488 (14.496315002441406MB) 27.51842498779297% used To Space: capacity = 20971520 (20.0MB) used = 0 (0.0MB) free = 20971520 (20.0MB) 0.0% used concurrent mark-sweep generation: capacity = 864026624 (824.0MB) used = 25506528 (24.324920654296875MB) free = 838520096 (799.6750793457031MB) 2.952053477463213% used
Created ‎01-24-2019 01:47 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Once the file is corruputed, you cannot recover from this even after setting dfs.client.block.write.replace-datanode-on-failure.policy=NEVER and restarting HDFS. As a work-around, I created a copy of the file and removed the old one.
