Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

hdfs trash compaction

Solved Go to solution
Highlighted

hdfs trash compaction

Expert Contributor

Default fs.trash.interval=0 & fs.trash.checkpoint.interval=0 indicating i.e. trash feature is disabled. What is recommended value for Production like clusters ? if these values are 0 then what is command to empty entire hdfs trash directories on periodic basis?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: hdfs trash compaction

Guru

Adding all details from other answer here to consolidate.

Try to keep fs.trash.interval longer (I prefer to keep it as one week). For fs.trashcheckpoint.interval, this is the interval of the thread that run to clean up all the trash that is longer than the fs.trash.interval. Keep this shorter, like twice a day or more. If you leave it at 0, cleanup happens every 7 days, so there can be some files that can stay for upto 14 days.

@Arul Ramachandran @Sean Creedon @Saumil Mayani

if fs.trash.checkpoint.interval < fs.trash.interval or == 0, fs.trash.interval is used as checkpoint interval. So, you can leave it as default 0, as long as you are ok leaving some data for longer in trash.

You can take a look at TrashIntervalDefault.java code that has the details.

Emptier(Configuration conf, long emptierInterval) throws IOException {
  this.conf = conf;
  this.emptierInterval = emptierInterval;
  if (emptierInterval > deletionInterval || emptierInterval == 0) {
    LOG.info("The configured checkpoint interval is " +
             (emptierInterval / MSECS_PER_MINUTE) + " minutes." +
             " Using an interval of " +
             (deletionInterval / MSECS_PER_MINUTE) +
             " minutes that is used for deletion instead");
    this.emptierInterval = deletionInterval;
  }

View solution in original post

9 REPLIES 9
Highlighted

Re: hdfs trash compaction

@Saumil Mayani

Default value for "fs.trash.interval" in HDP is 360minutes recommended which is 6hrs.

Also modifying this value again it depends upon priority of the data deleted. From past experience i usually will suggest to keep the value as 1day ie. 1440minute.

fs.trash.checkpoint.interval will be always smaller than "fs.trash.interval".

Re: hdfs trash compaction

Cloudera Employee

Hi Saumil,

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Space_Reclamat...

As the documentations says, to enable trash collection for a certain period, you can set it to a value greater than zero.

The fs.trash.interval can be set to 320 minutes (6 hours) or 1440 minutes (24 hours) depending on how long you would want to store your trash. The downside of storing more trash would be that the namenode would not be able to reclaim the blocks for the files.

The fs.trash.checkpoint.interval can be set to something smaller than the fs.trash.interval (1 hour or 3 hours). The process which runs based on this interval would basically create new checkpoints and delete any older checkpoints that have expired based on fs.trash.inteval

Hope this helps..

Highlighted

Re: hdfs trash compaction

Guru

From past experiences, use this one to be a high number, like atleast a week. While some accidental deletes are identified immediately, there are some cases when we only know about accidental data delete when we are debugging another issue downstream. If your cluster has good free space right now, leave it at a one week or two so you will have enough time to revert back deletes.

Highlighted

Re: hdfs trash compaction

New Contributor

@Saumil Mayani @Ravi Mutyala Trying to understand fs.trash.checkpoint.interval=0, the default setting. Say, we set fs.trash.interval= <X minutes> and leave fs.trash.checkpoint.interval=0 or not setting fs.trash.checkpoint.interval, how does the trash feature work? Does the trash checkpoint default to trash interval?

Highlighted

Re: hdfs trash compaction

Expert Contributor

Just in case you do want to manually clean the trash

expunge

Usage: hadoop fs -expunge

Empty the Trash. Refer to the HDFS Architecture Guide for more information on the Trash feature.

Highlighted

Re: hdfs trash compaction

New Contributor

what happens if fs.trash.interval=1440 and fs.trash.checkpoint.interval=0 does this mean the trash feature is disabled

Highlighted

Re: hdfs trash compaction

Guru

Adding all details from other answer here to consolidate.

Try to keep fs.trash.interval longer (I prefer to keep it as one week). For fs.trashcheckpoint.interval, this is the interval of the thread that run to clean up all the trash that is longer than the fs.trash.interval. Keep this shorter, like twice a day or more. If you leave it at 0, cleanup happens every 7 days, so there can be some files that can stay for upto 14 days.

@Arul Ramachandran @Sean Creedon @Saumil Mayani

if fs.trash.checkpoint.interval < fs.trash.interval or == 0, fs.trash.interval is used as checkpoint interval. So, you can leave it as default 0, as long as you are ok leaving some data for longer in trash.

You can take a look at TrashIntervalDefault.java code that has the details.

Emptier(Configuration conf, long emptierInterval) throws IOException {
  this.conf = conf;
  this.emptierInterval = emptierInterval;
  if (emptierInterval > deletionInterval || emptierInterval == 0) {
    LOG.info("The configured checkpoint interval is " +
             (emptierInterval / MSECS_PER_MINUTE) + " minutes." +
             " Using an interval of " +
             (deletionInterval / MSECS_PER_MINUTE) +
             " minutes that is used for deletion instead");
    this.emptierInterval = deletionInterval;
  }

View solution in original post

Highlighted

Re: hdfs trash compaction

Rising Star

Yes, when fs.trash.checkpoint.interval=0 or not setting fs.trash.checkpoint.interval, fs.trash.interval will be used as checkpoint interval.

Also, the fs.trash.checkpoint.interval should always be set as smaller than the fs.trash.interval. If it is not, fs.trash.interval will be used as checkpoint interval similar to the case above.

Highlighted

Re: hdfs trash compaction

Rising Star

For misconfiguration like the cases above, you will find INFO level log like below:

"The configured checkpoint interval is 0 minutes. Using an interval of XX (e.g., 60) minutes that is used for deletion instead"
Don't have an account?
Coming from Hortonworks? Activate your account here