Created on 02-10-2016 08:50 PM - edited 09-16-2022 01:34 AM
We recently had a customer who was doing stress testing on a very small cluster with just four Datanodes. Their tests included simulated crashes and sudden power-offs. They were getting recurring client errors of hdfs.DFSClient: Exception in createBlockOutputStream with various log messages including:
The reason for these errors has to do with:
As you know if you’ve read about the design of HDFS, when a block is opened for writing, a pipeline is created of r Datanodes (where r is the replication factor) to receive the replicas of the block. The Client sends the block data to the first Datanode, which sends it to the second, and so on. A block is only considered complete when all Datanodes in the pipeline have reported finishing writes of their replicas.
Replication through the pipeline of Datanodes happens in the background, and in parallel with Client writes, so completion of the block normally occurs almost immediately after the Client finishes writing the first replica.
However, if any of the Datanodes in the pipeline fail to successfully finish their writes, the Client attempts to recover the replication pipeline by finding a new available Datanode and replacing the failing Datanode. In Hadoop-2.6 and later (HDP-2.2 and later) this behavior is controlled by the following three configuration parameters in hdfs-site.xml. Please read how they work (from hdfs-default.xml):
<property> <name>dfs.client.block.write.replace-datanode-on-failure.enable</name> <value>true</value> <description> If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy </description> </property> <property> <name>dfs.client.block.write.replace-datanode-on-failure.policy</name> <value>DEFAULT</value> <description> This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. ALWAYS: always add a new datanode when an existing datanode is removed. NEVER: never add a new datanode. DEFAULT: Let r be the replication number. Let n be the number of existing datanodes. Add a new datanode only if r is greater than or equal to 3 and either (1) floor(r/2) is greater than or equal to n; or (2) r is greater than n and the block is hflushed/appended. </description> </property> <property> <name>dfs.client.block.write.replace-datanode-on-failure.best-effort</name> <value>false</value> <description> This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. Best effort means that the client will try to replace a failed datanode in write pipeline (provided that the policy is satisfied), however, it continues the write operation in case that the datanode replacement also fails. Suppose the datanode replacement fails. false: An exception should be thrown so that the write will fail. true : The write should be resumed with the remaining datandoes. Note that setting this property to true allows writing to a pipeline with a smaller number of datanodes. As a result, it increases the probability of data loss. </description> </property>
(The third “best-effort” parameter was not available in older versions of HDFS prior to Hadoop-2.6/HDP-2.2.)
We see then that with the “DEFAULT” policy, if the number of available Datanodes falls below 3 (the default replication factor), ordinary writing can continue, but Appends or writes with Hflush will fail, with exceptions like those mentioned above. The reason for Appends being held to a stricter standard of replication is that appends require manipulating generation numbers of the replicas, and maintaining consistency of generation number across replicas requires a great deal of care.
Unfortunately, in cases where there are still 3 or 4 working Datanodes, failures can still occur on writes or appends, although it should be infrequent. Basically, a very busy cluster (due to concurrency, specifically both the number of blocks being operated on concurrently and the quantity of large packets being moved over IP channels) may experience write pipeline failures due to socket timeout or other naturally occurring exceptions. When the cluster is large enough, it is easy to find a datanode for replacement. But when a cluster has only 3 or 4 datanodes, it may be hard or impossible to pick a different node, and so the exception is raised.
In a small cluster, that is being used for HBase, or otherwise experiencing a lot of Appends and explicit Hflush calls, it may make sense to set 'dfs.client.block.write.replace-datanode-on-failure.best-effort' to true. This enables Appends to continue without experiencing the exceptions noted above, yet allows the system to repair on the fly as best it can. Notice that this is trading the cast-iron reliability of guaranteed 3-way replication for a lower level of surety, to avoid these errors.
Hortonworks also suggests that small clusters of less than 10 Datanodes are inappropriate for stress testing, because HDFS reliability depends in part on having plenty of Datanodes to swap between when fatal or intermittent write problems occur.
It should also be noted that crash testing can of course cause replicas to become corrupted when a write is interrupted. Also, if writing is allowed to continue without completely recovering the replication pipeline, then the number of replicas may be less than required. Both of these problems are self-healing over time, but they do take some time, as follows.
Blocks with incorrect number of replicas will be detected every few seconds (under control of the ‘dfs.namenode.replication.interval’ parameter), by a scanning process inside the Namenode, and fixed shortly thereafter. This is different than corrupt replicas.
Corrupt replicas will have incorrect checksums, and this protects the Client from using any corrupt replicas they may encounter. When corrupt replicas are detected by attempted client access, they are immediately put in a fix queue. There is also a background process that very slowly sweeps the entire storage system for corrupt replicas; this requires reading the replicas themselves, and typically takes days or longer per cycle on a production cluster. Once in the fix queue, replicas are typically fixed within minutes, unless the system is extraordinarily busy.