Support Questions

Find answers, ask questions, and share your expertise

HDFS replication factor for a directory.

avatar

Is there a way to configure a non-default replication factor for a HDFS directory such that all future files and sub-directories in that directory use that specific replication factor? Currently, we are using a work around of running a daemon process to set the replication factor for all files in the required directory.

Is there a better way to do this?

while true; do
    hdfs dfs -setrep -w 2 /tmp/
    sleep 30
done

I see at one point there was this JIRA https://issues.apache.org/jira/browse/HDFS-199 opened but is blocked by this JIRA [https://issues.apache.org/jira/browse/HADOOP-4771]

1 ACCEPTED SOLUTION

avatar

As far as I know, this is currently not possible, not sure why this feature was not pushed in the last couple years. Maybe multi-tenancy wasn't really an issue. I dont think anyone is working on HDFS-199 at the moment. I have seen a couple requests in our internal Jira regarding this, if you open a new feature enhancement with our support team, we might be able to get the ball rolling again.

Your workaround looks good, I'd keep it for now.

View solution in original post

4 REPLIES 4

avatar

As far as I know, this is currently not possible, not sure why this feature was not pushed in the last couple years. Maybe multi-tenancy wasn't really an issue. I dont think anyone is working on HDFS-199 at the moment. I have seen a couple requests in our internal Jira regarding this, if you open a new feature enhancement with our support team, we might be able to get the ball rolling again.

Your workaround looks good, I'd keep it for now.

avatar

Thanks for confirming @Jonas Straub

avatar

There is currently no way to define a replication factor on a directory and have it cascade down automatically to all child files.

Instead of running the daemon process to change replication factor, do you have the option of setting the replication factor explicitly when you create the file? For example, here is how you can override it while saving a file through the CLI.

> hdfs dfs -D dfs.replication=2 -put hello /hello

> hdfs dfs -stat 'name=%n repl=%r' /hello

name=hello repl=2

If your use case is something like a MapReduce job, then you can override dfs.replication at job submission time too. Creating the file with the desired replication in the first place has an advantage over creating the file with replication factor 3 and then retroactively changing it to 2. Creating it with replication factor 3 temporarily wastes disk space. Changing it to replication factor 2 then creates extra work for the cluster to detect that some blocks are over-replicated, and replicas need to be deleted.

avatar
Expert Contributor

It is easy to override default hdfs properties like replication factor or HDFS block size at time of directory creation.