Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

General guidelines and best practices for tuning dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold property

avatar
Master Mentor

I'm looking for general guidelines and best practices from the field on the following two properties in hdfs-site.xml. I am looking for more than description derived from hdfs-default.xml. What are people seeing and what are some of the production values for the two configuration properties?

dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold
dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction
1 ACCEPTED SOLUTION

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
7 REPLIES 7

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Rising Star

Hi @Arpit Agarwal I don't know the intricacies of this. But trying to understand which is a better option - to run the balancer as a recovery mechanism at regular intervals or use a better placement policy while writing the blocks itself. I presume the default block placement policy is RR. So if the placement is round-robin, then the smaller disks are filled-up faster. Instead if the placement policy can take available space and as well as IO throughput for each disk, wouldn't that be a better choice?

Also, as documented these two properties are only applicable when dfs.datanode.fsdataset.volume.choosing.policy is set to org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml) But I couldn't find any property named dfs.datanode.fsdataset.volume.choosing.policy. Please let me know where this is set.

Please correct me if I am wrong in my understanding.

avatar

Hi @Greenhorn Techie, yes I agree the ideal placement policy would factor in available space and IO load. However there is no implementation that currently does that.

The property "dfs.datanode.fsdataset.volume.choosing.policy is defined in hdfs-default.xml:

<property>
  <name>dfs.datanode.fsdataset.volume.choosing.policy</name>
  <value></value>
  <description>
    The class name of the policy for choosing volumes in the list of
    directories.  Defaults to
    org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy.
    If you would like to take into account available disk space, set the
    value to
    "org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy".
  </description>
</property>


avatar
Rising Star

Thanks @Arpit Agarwal for your response. So finally it boils down to choosing RR vs AvailableSpace policies and Hortonworks recommends using RR policy with DiskBalancer vs Cloudera's recommendation of AvailableSpace policy? Am I correct in saying that? 🙂

avatar

Hortonworks recommends using the default RoundRobin policy.

avatar
New Contributor

I have the exact same question. @Artem Ervits have you come to any conclusion since this thread died last July?

avatar
Master Mentor

@Anant Rathi I have some verified answers in this thread from engineering and also another answer from @Chris Nauroth there's a reference blog http://gbif.blogspot.com/2015/05/dont-fill-your-hdfs-disks-upgrading-to.html we don't have field agreement to one or the other policy p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Calibri} span.s1 {font-kerning: none}

AvailableSpaceVolumeChoosingPolicy is not something that we have ever formally tested or certified. It was developed at Cloudera. We do not certify it under our support.