Support Questions

Find answers, ask questions, and share your expertise

Deleting old data from HDFS once storage approaches full capacity

avatar
New Contributor

I have a system where I'm ingesting data into a HDFS cluster which if left unchecked would fill up all my storage. At some point the cluster will start to reach capacity, and when that happens I'd like to delete data in certain directories in order to prevent reaching 100% capacity and not being able to add any more data.

I'm envisioning the process would be something like:

  • The data storage on the filesystem goes above 90% of capacity
  • Files are deleted from the oldest to the newest until storage drops back below 90%

I'd expect this would be something that would run once per day rather than would be constantly running in the background. This seems more complex than the use case of 'Delete anything more than x days old'.

I've considered writing my own scripts to do this, but given that this seems like a sort of task that would be fairly common I wanted to see if there was a recommended way of doing this in HDFS. I'm running Ambari, NiFi, Oozie and the rest of the suite.

Have people encountered this sort of requirement before? If so what's an effective way of handling it?

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Guru

Hi @scottgr

Writing scripts is usually the most popular way. The scripts can be automated through a number of ways including oozie/workflow manager UI in Ambari. While it is a common task, each use case is different. We can't make a general assumption about any cluster since to some people, older data may be more important than newer data depending on the data set. It gets even further complicated in multi-tenant datalakes as you can imagine.

View solution in original post

2 REPLIES 2

avatar
Guru

Hi @scottgr

Writing scripts is usually the most popular way. The scripts can be automated through a number of ways including oozie/workflow manager UI in Ambari. While it is a common task, each use case is different. We can't make a general assumption about any cluster since to some people, older data may be more important than newer data depending on the data set. It gets even further complicated in multi-tenant datalakes as you can imagine.

avatar
New Contributor

That's great. Thanks for the response - much appreciated. I'm happy to create a script to do this for our particular use case, I just wanted to make sure I wasn't reinventing the wheel.