Created 11-17-2015 06:19 PM
Is there a way to configure a non-default replication factor for a HDFS directory such that all future files and sub-directories in that directory use that specific replication factor? Currently, we are using a work around of running a daemon process to set the replication factor for all files in the required directory.
Is there a better way to do this?
while true; do hdfs dfs -setrep -w 2 /tmp/ sleep 30 done
I see at one point there was this JIRA https://issues.apache.org/jira/browse/HDFS-199 opened but is blocked by this JIRA [https://issues.apache.org/jira/browse/HADOOP-4771]
Created 11-17-2015 06:35 PM
As far as I know, this is currently not possible, not sure why this feature was not pushed in the last couple years. Maybe multi-tenancy wasn't really an issue. I dont think anyone is working on HDFS-199 at the moment. I have seen a couple requests in our internal Jira regarding this, if you open a new feature enhancement with our support team, we might be able to get the ball rolling again.
Your workaround looks good, I'd keep it for now.
Created 11-17-2015 06:35 PM
As far as I know, this is currently not possible, not sure why this feature was not pushed in the last couple years. Maybe multi-tenancy wasn't really an issue. I dont think anyone is working on HDFS-199 at the moment. I have seen a couple requests in our internal Jira regarding this, if you open a new feature enhancement with our support team, we might be able to get the ball rolling again.
Your workaround looks good, I'd keep it for now.
Created 11-17-2015 10:19 PM
Thanks for confirming @Jonas Straub
Created 11-18-2015 06:20 PM
There is currently no way to define a replication factor on a directory and have it cascade down automatically to all child files.
Instead of running the daemon process to change replication factor, do you have the option of setting the replication factor explicitly when you create the file? For example, here is how you can override it while saving a file through the CLI.
> hdfs dfs -D dfs.replication=2 -put hello /hello > hdfs dfs -stat 'name=%n repl=%r' /hello name=hello repl=2
If your use case is something like a MapReduce job, then you can override dfs.replication at job submission time too. Creating the file with the desired replication in the first place has an advantage over creating the file with replication factor 3 and then retroactively changing it to 2. Creating it with replication factor 3 temporarily wastes disk space. Changing it to replication factor 2 then creates extra work for the cluster to detect that some blocks are over-replicated, and replicas need to be deleted.
Created 12-05-2015 08:16 PM
It is easy to override default hdfs properties like replication factor or HDFS block size at time of directory creation.