I have a system where I'm ingesting data into a HDFS cluster which if left unchecked would fill up all my storage. At some point the cluster will start to reach capacity, and when that happens I'd like to delete data in certain directories in order to prevent reaching 100% capacity and not being able to add any more data. I'm envisioning the process would be something like: The data storage on the filesystem goes above 90% of capacity Files are deleted from the oldest to the newest until storage drops back below 90% I'd expect this would be something that would run once per day rather than would be constantly running in the background. This seems more complex than the use case of 'Delete anything more than x days old'. I've considered writing my own scripts to do this, but given that this seems like a sort of task that would be fairly common I wanted to see if there was a recommended way of doing this in HDFS. I'm running Ambari, NiFi, Oozie and the rest of the suite. Have people encountered this sort of requirement before? If so what's an effective way of handling it? Thanks in advance.
... View more