Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

when need to set Block replication to 1


we get the following in spark logs Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage DatanodeInfoWithStorage\
The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode( 

my ambari cluster include only 3 workers machines and each worker have only one data disk

I search in google and find solution can be about:

Block replication need to be set as 1 instead of 3 ( HDFS )

is it true ?

second - because my worker machine have obnly one data disk is it can be part of the problem ?

Block replication = The total number of files in the file system will be what's specified in the dfs.replication factor setting dfs.replication=1, means will be only one copy of the file in the file system.



1. Block replication if for redundancy of data which ensures data is not lost due to bad disk or node going down.
2. Replication 1 is set in situation when data can recreated at any point of time, the loss of data is not crucial. Like a job chain, output of one job is consumed by others and ebntually all intermediate data needs to be deleted. The intermediate data can be marked for Replication of 1 ( Still its good to have 2 )
3. Replication factor of 1 makes the cluster fault tolerant.

In you case you have 3 worker node, RF of 1 means if a worker is bad, you loose data and the it cant be processed.
I suggest you to use at RF=2 if you are concerned about space utilization.

View solution in original post



1. Block replication if for redundancy of data which ensures data is not lost due to bad disk or node going down.
2. Replication 1 is set in situation when data can recreated at any point of time, the loss of data is not crucial. Like a job chain, output of one job is consumed by others and ebntually all intermediate data needs to be deleted. The intermediate data can be marked for Replication of 1 ( Still its good to have 2 )
3. Replication factor of 1 makes the cluster fault tolerant.

In you case you have 3 worker node, RF of 1 means if a worker is bad, you loose data and the it cant be processed.
I suggest you to use at RF=2 if you are concerned about space utilization.