I've noticed that the spark history server doesn't cleanup any (really old) .inprogress files - which makes sense in a way as it can't distinguish between what is actually running and what not.
Is there an easy way to automate this cleanup? We've got files here going back to 2016.
@Jan De Luyck
Set these below parameters in your “Custom spark-defaults” config setting in Ambari (or your spark-env.sh) to take care of these massive logs:
spark.history.fs.cleaner.enabled=true spark.history.fs.cleaner.interval=1d spark.history.fs.cleaner.maxAge=5d
This is very difficult to identify the active files now which are in progress state.
Please look for the RUNNING jobs from RM WebUI and remove all the other in progress files that are not listed in RUNNING state.
To check for the RUNNING Jobs from RM WeBUI please follow this steps
1. Login into Cloudera Manager.
2. Choose Yarn as Service
3. Click WEBUI
4. Choose Resource Manager WEBUI
5. A New Screen will be displayed showing list of all applications.
6. On the Left Hand side you can see links displayed there under applications link. Click the "Running" link displayed under Applications link
7. This "Running" link will all show the in-progress jobs that are active,.
8. Please look for the RUNNING jobs from RM WebUI and remove all the other in progress jobs that are not listed in RUNNING state.