Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HDFS replication factor for a directory.

Solved Go to solution

HDFS replication factor for a directory.

Is there a way to configure a non-default replication factor for a HDFS directory such that all future files and sub-directories in that directory use that specific replication factor? Currently, we are using a work around of running a daemon process to set the replication factor for all files in the required directory.

Is there a better way to do this?

while true; do
    hdfs dfs -setrep -w 2 /tmp/
    sleep 30
done

I see at one point there was this JIRA https://issues.apache.org/jira/browse/HDFS-199 opened but is blocked by this JIRA [https://issues.apache.org/jira/browse/HADOOP-4771]

1 ACCEPTED SOLUTION

Accepted Solutions

Re: HDFS replication factor for a directory.

As far as I know, this is currently not possible, not sure why this feature was not pushed in the last couple years. Maybe multi-tenancy wasn't really an issue. I dont think anyone is working on HDFS-199 at the moment. I have seen a couple requests in our internal Jira regarding this, if you open a new feature enhancement with our support team, we might be able to get the ball rolling again.

Your workaround looks good, I'd keep it for now.

4 REPLIES 4

Re: HDFS replication factor for a directory.

As far as I know, this is currently not possible, not sure why this feature was not pushed in the last couple years. Maybe multi-tenancy wasn't really an issue. I dont think anyone is working on HDFS-199 at the moment. I have seen a couple requests in our internal Jira regarding this, if you open a new feature enhancement with our support team, we might be able to get the ball rolling again.

Your workaround looks good, I'd keep it for now.

Re: HDFS replication factor for a directory.

Thanks for confirming @Jonas Straub

Re: HDFS replication factor for a directory.

There is currently no way to define a replication factor on a directory and have it cascade down automatically to all child files.

Instead of running the daemon process to change replication factor, do you have the option of setting the replication factor explicitly when you create the file? For example, here is how you can override it while saving a file through the CLI.

> hdfs dfs -D dfs.replication=2 -put hello /hello

> hdfs dfs -stat 'name=%n repl=%r' /hello

name=hello repl=2

If your use case is something like a MapReduce job, then you can override dfs.replication at job submission time too. Creating the file with the desired replication in the first place has an advantage over creating the file with replication factor 3 and then retroactively changing it to 2. Creating it with replication factor 3 temporarily wastes disk space. Changing it to replication factor 2 then creates extra work for the cluster to detect that some blocks are over-replicated, and replicas need to be deleted.

Highlighted

Re: HDFS replication factor for a directory.

Rising Star

It is easy to override default hdfs properties like replication factor or HDFS block size at time of directory creation.

Don't have an account?
Coming from Hortonworks? Activate your account here