Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

For Namenode HA environment, what is the recommended value for dfs.client.retry.policy.enabled and why?

Highlighted

For Namenode HA environment, what is the recommended value for dfs.client.retry.policy.enabled and why?

For a cluster with Namenode HA, I want to know what should be the recommended value for :

dfs.client.retry.policy.enabled

The default is false.

I would also like to understand the reasoning behind the value we choose for this property in case of Namenode HA.

2 REPLIES 2

Re: For Namenode HA environment, what is the recommended value for dfs.client.retry.policy.enabled and why?

@Dinesh Chitlangia

The property dfs.client.retry.policy.enabled is important when HA is enabled, as it enables HDFS client retry in case of NameNode failure. So, after enabling HDFS HA, the property should be set to true in hdfs-site.xml.

In case, dfs.client.retry.policy.enabled=false in HA environment, then the Namenode connection attempt is made only once and would fail without attempting to connect to failover node. Snippet from code is as below:

+  /**
+   * Return the default retry policy used in RPC.
+   * 
+   * If dfs.client.retry.policy.enabled == false, use TRY_ONCE_THEN_FAIL.
+   * 
+   * Otherwise, first unwrap ServiceException if possible, and then
+   * (1) use multipleLinearRandomRetry for
+   *     - SafeModeException, or
+   *     - IOException other than RemoteException, or
+   *     - ServiceException; and
+   * (2) use TRY_ONCE_THEN_FAIL for
+   *     - non-SafeMode RemoteException, or
+   *     - non-IOException.

Re: For Namenode HA environment, what is the recommended value for dfs.client.retry.policy.enabled and why?

Guru

The property

dfs.client.retry.policy.enabled 

should be set to false in cluster with HA enabled.

The reason being in a Namenode High Availability (NN HA) system, when one of the namenodes goes down (NN process stopped), attempts to use hdfs can result in repeating errors and apparent hangs. Running or new jobs that depended on HDFS access will also fail because the failed NN is being talked to.