Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

when need to set Block replication to 1

avatar

we get the following in spark logs

java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage DatanodeInfoWithStorage\
The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1036) 

my ambari cluster include only 3 workers machines and each worker have only one data disk

I search in google and find solution can be about:

Block replication need to be set as 1 instead of 3 ( HDFS )

is it true ?

second - because my worker machine have obnly one data disk is it can be part of the problem ?

Block replication = The total number of files in the file system will be what's specified in the dfs.replication factor setting dfs.replication=1, means will be only one copy of the file in the file system.

Michael-Bronson
1 ACCEPTED SOLUTION

avatar

1. Block replication if for redundancy of data which ensures data is not lost due to bad disk or node going down.
2. Replication 1 is set in situation when data can recreated at any point of time, the loss of data is not crucial. Like a job chain, output of one job is consumed by others and ebntually all intermediate data needs to be deleted. The intermediate data can be marked for Replication of 1 ( Still its good to have 2 )
3. Replication factor of 1 makes the cluster fault tolerant.

In you case you have 3 worker node, RF of 1 means if a worker is bad, you loose data and the it cant be processed.
I suggest you to use at RF=2 if you are concerned about space utilization.

View solution in original post

1 REPLY 1

avatar

1. Block replication if for redundancy of data which ensures data is not lost due to bad disk or node going down.
2. Replication 1 is set in situation when data can recreated at any point of time, the loss of data is not crucial. Like a job chain, output of one job is consumed by others and ebntually all intermediate data needs to be deleted. The intermediate data can be marked for Replication of 1 ( Still its good to have 2 )
3. Replication factor of 1 makes the cluster fault tolerant.

In you case you have 3 worker node, RF of 1 means if a worker is bad, you loose data and the it cant be processed.
I suggest you to use at RF=2 if you are concerned about space utilization.