Created 08-18-2017 09:43 AM
I have a system where I'm ingesting data into a HDFS cluster which if left unchecked would fill up all my storage. At some point the cluster will start to reach capacity, and when that happens I'd like to delete data in certain directories in order to prevent reaching 100% capacity and not being able to add any more data.
I'm envisioning the process would be something like:
I'd expect this would be something that would run once per day rather than would be constantly running in the background. This seems more complex than the use case of 'Delete anything more than x days old'.
I've considered writing my own scripts to do this, but given that this seems like a sort of task that would be fairly common I wanted to see if there was a recommended way of doing this in HDFS. I'm running Ambari, NiFi, Oozie and the rest of the suite.
Have people encountered this sort of requirement before? If so what's an effective way of handling it?
Thanks in advance.
Created 08-21-2017 03:44 PM
Hi @scottgr
Writing scripts is usually the most popular way. The scripts can be automated through a number of ways including oozie/workflow manager UI in Ambari. While it is a common task, each use case is different. We can't make a general assumption about any cluster since to some people, older data may be more important than newer data depending on the data set. It gets even further complicated in multi-tenant datalakes as you can imagine.
Created 08-21-2017 03:44 PM
Hi @scottgr
Writing scripts is usually the most popular way. The scripts can be automated through a number of ways including oozie/workflow manager UI in Ambari. While it is a common task, each use case is different. We can't make a general assumption about any cluster since to some people, older data may be more important than newer data depending on the data set. It gets even further complicated in multi-tenant datalakes as you can imagine.
Created 08-22-2017 10:34 AM
That's great. Thanks for the response - much appreciated. I'm happy to create a script to do this for our particular use case, I just wanted to make sure I wasn't reinventing the wheel.