Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

General guidelines and best practices for tuning dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold property

Solved Go to solution

General guidelines and best practices for tuning dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold property

Mentor

I'm looking for general guidelines and best practices from the field on the following two properties in hdfs-site.xml. I am looking for more than description derived from hdfs-default.xml. What are people seeing and what are some of the production values for the two configuration properties?

dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold
dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction
1 ACCEPTED SOLUTION

Accepted Solutions

Re: General guidelines and best practices for tuning dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold property

Hi Artem, we do not recommend using AvailableSpaceVolumeChoosingPolicy. It can cause a subset of disk drives to become a bottleneck for writes. See HDFS-8538 for some more discussion on this.

A new HDFS tool called the DiskBalancer is under active development (HDFS-1312). It will allow administrators to recover from skewed distribution caused by replacing failed disks or just adding new disks.

7 REPLIES 7

Re: General guidelines and best practices for tuning dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold property

Hi Artem, we do not recommend using AvailableSpaceVolumeChoosingPolicy. It can cause a subset of disk drives to become a bottleneck for writes. See HDFS-8538 for some more discussion on this.

A new HDFS tool called the DiskBalancer is under active development (HDFS-1312). It will allow administrators to recover from skewed distribution caused by replacing failed disks or just adding new disks.

Re: General guidelines and best practices for tuning dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold property

New Contributor

Hi @Arpit Agarwal I don't know the intricacies of this. But trying to understand which is a better option - to run the balancer as a recovery mechanism at regular intervals or use a better placement policy while writing the blocks itself. I presume the default block placement policy is RR. So if the placement is round-robin, then the smaller disks are filled-up faster. Instead if the placement policy can take available space and as well as IO throughput for each disk, wouldn't that be a better choice?

Also, as documented these two properties are only applicable when dfs.datanode.fsdataset.volume.choosing.policy is set to org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml) But I couldn't find any property named dfs.datanode.fsdataset.volume.choosing.policy. Please let me know where this is set.

Please correct me if I am wrong in my understanding.

Re: General guidelines and best practices for tuning dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold property

Hi @Greenhorn Techie, yes I agree the ideal placement policy would factor in available space and IO load. However there is no implementation that currently does that.

The property "dfs.datanode.fsdataset.volume.choosing.policy is defined in hdfs-default.xml:

<property>
  <name>dfs.datanode.fsdataset.volume.choosing.policy</name>
  <value></value>
  <description>
    The class name of the policy for choosing volumes in the list of
    directories.  Defaults to
    org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy.
    If you would like to take into account available disk space, set the
    value to
    "org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy".
  </description>
</property>


Re: General guidelines and best practices for tuning dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold property

New Contributor

Thanks @Arpit Agarwal for your response. So finally it boils down to choosing RR vs AvailableSpace policies and Hortonworks recommends using RR policy with DiskBalancer vs Cloudera's recommendation of AvailableSpace policy? Am I correct in saying that? :)

Re: General guidelines and best practices for tuning dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold property

Hortonworks recommends using the default RoundRobin policy.

Highlighted

Re: General guidelines and best practices for tuning dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold property

New Contributor

I have the exact same question. @Artem Ervits have you come to any conclusion since this thread died last July?

Re: General guidelines and best practices for tuning dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold property

Mentor

@Anant Rathi I have some verified answers in this thread from engineering and also another answer from @Chris Nauroth there's a reference blog http://gbif.blogspot.com/2015/05/dont-fill-your-hdfs-disks-upgrading-to.html we don't have field agreement to one or the other policy p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Calibri} span.s1 {font-kerning: none}

AvailableSpaceVolumeChoosingPolicy is not something that we have ever formally tested or certified. It was developed at Cloudera. We do not certify it under our support.

Don't have an account?
Coming from Hortonworks? Activate your account here