Created on 10-22-2013 09:24 AM - edited 09-16-2022 01:49 AM
I have a small 3 node cluster and am experiencing total failure when running Reduce jobs. I searched through syslog and found errors pointing to this variable:
dfs.client.block.write.replace-datanode-on-failure.enable
If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy
How can I set this variable from the CDH4 Cloudera Manager interface? Or do I need to manually edit an XML file?
Created 10-24-2013 10:56 AM
Hey Ben,
In a CM-managed cluster, CM will take care of managing and deploying configurations for you (including setting custom options like this). Manually editing the config files is brittle, since CM might push new configs ontop of it. Anyway, to add custom options, search for "safety valve" in the configuration editor, and you can paste in xml directly.
Note also that these two properties are specific to the client, not the datanode or NN, so you probably want to be dropping this into the "HDFS Client Configuration Safety Valve for hdfs-site.xml" box.
Best,
Andrew
Created 10-24-2013 05:16 AM
Answer:
For my cluster, I directly modified the /etc/hadoop/conf/hdfs-site.xml file on all 4 nodes, including namenode and datanodes.
I am able to locate other dfs.client variables in the Cloudera Manager:
host:7180/cmf/services/19/config
But the variable that I added manually does not show up in Cloudera Manager as far as I can see.
Another variable to set in conjunction with this is:
dfs.client.block.write.replace-datanode-on-failure.policy
See http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
Created 10-24-2013 10:56 AM
Hey Ben,
In a CM-managed cluster, CM will take care of managing and deploying configurations for you (including setting custom options like this). Manually editing the config files is brittle, since CM might push new configs ontop of it. Anyway, to add custom options, search for "safety valve" in the configuration editor, and you can paste in xml directly.
Note also that these two properties are specific to the client, not the datanode or NN, so you probably want to be dropping this into the "HDFS Client Configuration Safety Valve for hdfs-site.xml" box.
Best,
Andrew
Created 10-24-2013 11:33 AM
Andrew, thanks, I am currently working on purchasing Cloudera Enterprise licenses so we can get official support. I will look for the area you are indicating.
For anyone else reading the thread, the reason we needed to set these variables is that MapReduce jobs (via Pig scripting) were failing at the Reduce phase. Based on system logs, I was able to trace it to these variables. In the Hadoop documentation, you can see that a 3 node cluster is considered "extremely" small and that is exactly what we are running - 1 namenode and 3 datanodes. Also, replication is set to 3, meaning all datanodes must be operational in order for HDFS to be healthy. In retrospect, with such a small cluster we would decrease replication to 2.
To be clear, we had to set both variables even though the documentation indicates you only need to set the first "enable" to "never". In fact we found that our problem was not fixed until we also set "policy" to "never".
dfs.client.block.write.replace-datanode-on-failure.enable | true | If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy |
dfs.client.block.write.replace-datanode-on-failure.policy | DEFAULT | This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. ALWAYS: always add a new datanode when an existing datanode is removed. NEVER: never add a new datanode. DEFAULT: Let r be the replication number. Let n be the number of existing datanodes. Add a new datanode only if r is greater than or equal to 3 and either (1) floor(r/2) is greater than or equal to n; or (2) r is greater than n and the block is hflushed/appended. |
Created 10-30-2013 05:15 AM
Created 12-09-2013 10:41 AM
Hi, what property did you modify, it sohuld be cluster-wide or related only to client?
Created 03-24-2014 12:49 AM
Hi, i just inserted some xml into the results of "HDFS Service Configuration Safety Valve for hdfs-site.xml" and deployed client configuration by CM UI, but it did not seem to work because when I opened /etc/hadoop/conf.cloudera.hdfs1/hdfs-site.xml, the file was updated but what i inserted was not found. So did not just add these parameters manually and it appeared in the "HDFS Service Configuration Safety Valve for hdfs-site.xml"?
Created 03-24-2014 01:02 AM
To add configuration snippets for a client config, the right field to use is the "HDFS Client Configuration Safety Valve for hdfs-site.xml", not the "Service" one, which only applies to daemons.
Created 06-03-2019 07:21 PM
hi all,
Can any one help me where can i find this set dfs.client.block.write.replace-datanode-on-failure.enable parameter in Cloudera manger.
I have serached in hdfs-site.xml. but i could not find the theses values..
Created 06-03-2019 07:33 PM