Support Questions

Find answers, ask questions, and share your expertise

hdfs trash compaction

avatar
Super Collaborator

Default fs.trash.interval=0 & fs.trash.checkpoint.interval=0 indicating i.e. trash feature is disabled. What is recommended value for Production like clusters ? if these values are 0 then what is command to empty entire hdfs trash directories on periodic basis?

1 ACCEPTED SOLUTION

avatar
Guru

Adding all details from other answer here to consolidate.

Try to keep fs.trash.interval longer (I prefer to keep it as one week). For fs.trashcheckpoint.interval, this is the interval of the thread that run to clean up all the trash that is longer than the fs.trash.interval. Keep this shorter, like twice a day or more. If you leave it at 0, cleanup happens every 7 days, so there can be some files that can stay for upto 14 days.

@Arul Ramachandran @Sean Creedon @Saumil Mayani

if fs.trash.checkpoint.interval < fs.trash.interval or == 0, fs.trash.interval is used as checkpoint interval. So, you can leave it as default 0, as long as you are ok leaving some data for longer in trash.

You can take a look at TrashIntervalDefault.java code that has the details.

Emptier(Configuration conf, long emptierInterval) throws IOException {
  this.conf = conf;
  this.emptierInterval = emptierInterval;
  if (emptierInterval > deletionInterval || emptierInterval == 0) {
    LOG.info("The configured checkpoint interval is " +
             (emptierInterval / MSECS_PER_MINUTE) + " minutes." +
             " Using an interval of " +
             (deletionInterval / MSECS_PER_MINUTE) +
             " minutes that is used for deletion instead");
    this.emptierInterval = deletionInterval;
  }

View solution in original post

9 REPLIES 9

avatar
Super Guru
@Saumil Mayani

Default value for "fs.trash.interval" in HDP is 360minutes recommended which is 6hrs.

Also modifying this value again it depends upon priority of the data deleted. From past experience i usually will suggest to keep the value as 1day ie. 1440minute.

fs.trash.checkpoint.interval will be always smaller than "fs.trash.interval".

avatar
Contributor

Hi Saumil,

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Space_Reclamat...

As the documentations says, to enable trash collection for a certain period, you can set it to a value greater than zero.

The fs.trash.interval can be set to 320 minutes (6 hours) or 1440 minutes (24 hours) depending on how long you would want to store your trash. The downside of storing more trash would be that the namenode would not be able to reclaim the blocks for the files.

The fs.trash.checkpoint.interval can be set to something smaller than the fs.trash.interval (1 hour or 3 hours). The process which runs based on this interval would basically create new checkpoints and delete any older checkpoints that have expired based on fs.trash.inteval

Hope this helps..

avatar
Guru

From past experiences, use this one to be a high number, like atleast a week. While some accidental deletes are identified immediately, there are some cases when we only know about accidental data delete when we are debugging another issue downstream. If your cluster has good free space right now, leave it at a one week or two so you will have enough time to revert back deletes.

avatar
New Contributor

@Saumil Mayani @Ravi Mutyala Trying to understand fs.trash.checkpoint.interval=0, the default setting. Say, we set fs.trash.interval= <X minutes> and leave fs.trash.checkpoint.interval=0 or not setting fs.trash.checkpoint.interval, how does the trash feature work? Does the trash checkpoint default to trash interval?

avatar
Expert Contributor

Just in case you do want to manually clean the trash

expunge

Usage: hadoop fs -expunge

Empty the Trash. Refer to the HDFS Architecture Guide for more information on the Trash feature.

avatar
New Contributor

what happens if fs.trash.interval=1440 and fs.trash.checkpoint.interval=0 does this mean the trash feature is disabled

avatar
Guru

Adding all details from other answer here to consolidate.

Try to keep fs.trash.interval longer (I prefer to keep it as one week). For fs.trashcheckpoint.interval, this is the interval of the thread that run to clean up all the trash that is longer than the fs.trash.interval. Keep this shorter, like twice a day or more. If you leave it at 0, cleanup happens every 7 days, so there can be some files that can stay for upto 14 days.

@Arul Ramachandran @Sean Creedon @Saumil Mayani

if fs.trash.checkpoint.interval < fs.trash.interval or == 0, fs.trash.interval is used as checkpoint interval. So, you can leave it as default 0, as long as you are ok leaving some data for longer in trash.

You can take a look at TrashIntervalDefault.java code that has the details.

Emptier(Configuration conf, long emptierInterval) throws IOException {
  this.conf = conf;
  this.emptierInterval = emptierInterval;
  if (emptierInterval > deletionInterval || emptierInterval == 0) {
    LOG.info("The configured checkpoint interval is " +
             (emptierInterval / MSECS_PER_MINUTE) + " minutes." +
             " Using an interval of " +
             (deletionInterval / MSECS_PER_MINUTE) +
             " minutes that is used for deletion instead");
    this.emptierInterval = deletionInterval;
  }

avatar
Expert Contributor

Yes, when fs.trash.checkpoint.interval=0 or not setting fs.trash.checkpoint.interval, fs.trash.interval will be used as checkpoint interval.

Also, the fs.trash.checkpoint.interval should always be set as smaller than the fs.trash.interval. If it is not, fs.trash.interval will be used as checkpoint interval similar to the case above.

avatar
Expert Contributor

For misconfiguration like the cases above, you will find INFO level log like below:

"The configured checkpoint interval is 0 minutes. Using an interval of XX (e.g., 60) minutes that is used for deletion instead"