I have an HDP 2.6.1 cluster where we’ve had yarn.log-aggregation.retain-seconds set to 30 days for a while, and everything was working properly. Four days ago we changed the property to 15 days instead and restarted the services. The check interval is set to the default, so we expected within 1.5 days, we’d see the logs older than 15 days deleted.
For some reason, we are still seeing 30 days of logs kept. The other properties all seem to be set properly. The only weird setting I can find is that we are using the LogAggregationIndexedFileController as our primary file controller class. The LogAggregationTFileController is still available as the second in the list.
I found YARN-8279 (https://issues.apache.org/jira/browse/YARN-8279), which seems sort of related, except that we are still seeing logs being put into the right suffix folder, and it still seems to be deleting logs older than 30 days. It just doesn’t seem to have updated to 15 days as the cutoff instead.
I’ve looked in the logs for the Resource Manager, Timeline Server, and one of the Name Nodes, and nothing that would explain this has popped up. Any ideas where to go to figure out what is happening? Additionally, can someone confirm in which process the deletion service actually runs? Is it the resource manager, timeline server, or something else?