Support Questions

mike_bronson7 · ‎01-31-2018

need advice why we get the error about - Failed to replace a bad datanode on the existing pipeline due to no more good datanodes?

I saw also other quastion that talk about my problem -https://community.hortonworks.com/questions/27153/getting-ioexception-failed-to-replace-a-bad-datano.html

Log description :

java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[34.2.31.31:50010,DS-8234bb39-0fd4-49be-98ba-32080bc24fa9,DISK], DatanodeInfoWithStorage[34.2.31.33:50010,DS-b4758979-52a2-4238-99f0-1b5ec45a7e25,DISK]], original=[DatanodeInfoWithStorage[34.2.31.31:50010,DS-8234bb39-0fd4-49be-98ba-32080bc24fa9,DISK], DatanodeInfoWithStorage[34.2.31.33:50010,DS-b4758979-52a2-4238-99f0-1b5ec45a7e25,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1036)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1110)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1268)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:993)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:500)
---2018-01-30T15:15:15.015 INFO  [][][] [dal.locations.LocationsDataFramesHandler]

Michael-Bronson

jsensharma · ‎01-31-2018

@Michael Bronson

The two properties 'dfs.client.block.write.replace-datanode-on-failure.policy' and 'dfs.client.block.write.replace-data node-on-failure.enable' influences the client side behavior for the pipeline recovery and these properties can be added as custom properties in the "hdfs-site" configuration.

Continuous network issues causing or repeated packet drops can lead to such issues. This specially happens when data is being written to any one of the DataNode which is in process of pipelining the data to next datanode and due to any communicaiton issue it may lead to pipeline failure. It can also happen when HDFS client hangs or observs connection timesout due to some memory contention smaller heap size or ulimits.

.

So please check if your DataNodes are healthy and there is no N/W packet drop or communication issue.

View solution in original post

jsensharma · ‎01-31-2018

@Michael Bronson

The two properties 'dfs.client.block.write.replace-datanode-on-failure.policy' and 'dfs.client.block.write.replace-data node-on-failure.enable' influences the client side behavior for the pipeline recovery and these properties can be added as custom properties in the "hdfs-site" configuration.

Continuous network issues causing or repeated packet drops can lead to such issues. This specially happens when data is being written to any one of the DataNode which is in process of pipelining the data to next datanode and due to any communicaiton issue it may lead to pipeline failure. It can also happen when HDFS client hangs or observs connection timesout due to some memory contention smaller heap size or ulimits.

.

So please check if your DataNodes are healthy and there is no N/W packet drop or communication issue.

jsensharma · ‎01-31-2018

@Michael Bronson

The following article can help in understanding more about these properties:

https://community.hortonworks.com/articles/16144/write-or-append-failures-in-very-small-clusters-un....

mike_bronson7 · ‎01-31-2018

@jay please advice what is the best way to check DataNodes are healthy?

Michael-Bronson

jsensharma · ‎01-31-2018

@Michael Bronson

May be we can try running some HDFS Service checks form AmbarI UI.

Checking the DataNode logs can also give us some idea like if they are sufferring from Memory limitations or if there are some repeated errors. We can check the DataNode memory utilization to see if they have enough memory and how much is being used currently.

# $JAVA_HOME/bin/jmap -heap $DATANODE_PID

- We can also check if the DataNode ports are accessible from other nodes and if there is any communication issue. From one datanode host please check if we can connect to other datanode port.

# telnet $DATANODE_HOSTNAME   $DATANODE_PORT

.

mike_bronson7 · ‎01-31-2018

from hdfs dfsadmin -report

what we can do with this - Missing blocks ( anythuing to do regarding that ? ) 

we got 

<br>hdfs dfsadmin -report
Configured Capacity: 8226130288640 (7.48 TB)
Present Capacity: 8225508617182 (7.48 TB)
DFS Remaining: 8205858544606 (7.46 TB)
DFS Used: 19650072576 (18.30 GB)
DFS Used%: 0.24%
Under replicated blocks: 4
Blocks with corrupt replicas: 0
Missing blocks: 4
Missing blocks (with replication factor 1): 0

Michael-Bronson

mike_bronson7 · ‎01-31-2018

regarding the - DATANODE_PID , how to find it ? ( I guess from the worker machine ? )

Michael-Bronson

jsensharma · ‎01-31-2018

@Michael Bronson

dfsadmin report might not be very helpful here.

Regarding the DataNode PID we can do any of the following to find out the PID of DataNode:

# ps -ef | grep DataNode
(OR)
# cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid

.

Also if you want to list the ports used by the DataNode then you can run the following command:

# netstat -tnlpa | grep `cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid`
(OR)
# netstat -tnlpa | grep $DATANODE_PID

.

mike_bronson7 · ‎01-31-2018

@Jay not see from the output something negetive , or maybe you want to add your opinion


/usr/jdk64/jdk1.8.0_112/bin/jmap  -heap 26765
Attaching to process ID 26765, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.112-b15
using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 1073741824 (1024.0MB)
   NewSize                  = 209715200 (200.0MB)
   MaxNewSize               = 209715200 (200.0MB)
   OldSize                  = 864026624 (824.0MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 188743680 (180.0MB)
   used     = 13146000 (12.537002563476562MB)
   free     = 175597680 (167.46299743652344MB)
   6.9650014241536455% used
Eden Space:
   capacity = 167772160 (160.0MB)
   used     = 7374968 (7.033317565917969MB)
   free     = 160397192 (152.96668243408203MB)
   4.3958234786987305% used
From Space:
   capacity = 20971520 (20.0MB)
   used     = 5771032 (5.503684997558594MB)
   free     = 15200488 (14.496315002441406MB)
   27.51842498779297% used
To Space:
   capacity = 20971520 (20.0MB)
   used     = 0 (0.0MB)
   free     = 20971520 (20.0MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 864026624 (824.0MB)
   used     = 25506528 (24.324920654296875MB)
   free     = 838520096 (799.6750793457031MB)
   2.952053477463213% used

Michael-Bronson

tanmoy_official · ‎01-24-2019

Once the file is corruputed, you cannot recover from this even after setting dfs.client.block.write.replace-datanode-on-failure.policy=NEVER and restarting HDFS. As a work-around, I created a copy of the file and removed the old one.