Created 10-31-2019 10:16 AM
Hi
I am running out of space on one of our DN, and found that Yarn container logs are consuming around 50GB, though there are 3 Spark jobs/applications running which can be seen under Yarn > Applications tab, but I see a lot of other application ids also listed, would like to know if these could be deleted to reclaim space (the ones which are not running)
Secondly, any precautions to be taken into consideration while deleting these container logs. Any guidance would be of great help.
Regards
Wert
Created 11-05-2019 07:51 PM
Created 10-31-2019 10:56 AM
Hey,
I think, deleting container logs may be a good option to save space. However, if you would like to grab the yarn logs for analysing the old jobs, then you may need those container logs.
Also, I think those analyses are required, when a job fails. So, if you think those jobs will not be dug again to gather any of the historic insights, then you may feel free to clear them.
Created 10-31-2019 03:02 PM
Created 11-01-2019 12:56 AM
@gsthina Thanks for your reply, post clean-up container logs are at comfortable at 660M.
@EricL path for container logs are (/data/yarn/container-logs), since I have you guys with me on this I would also like to check something about yarn filecache.
I see it’s consuming around 9.3G of space any way we can have this reduced? via an automated way?
Following are the Yarn settings which I, think controls filecache…. However, you guys are the experts
yarn.nodemanager.localizer.cache.target-size-mb = 10GB
yarn.nodemanager.localizer.cache.cleanup.interval-ms = 10 Minutes
Spce Used:
du -shc /data/yarn/nm/usercache/MyApplication/*
981M /data/yarn/nm/usercache/ MyApplication /appcache
9.3G /data/yarn/nm/usercache/ MyApplication /filecache
Created 11-01-2019 02:50 AM
Hi @wert_1311 ,
Thanks for asking.
Currently yarn.nodemanager.localizer.cache.target-size-mb and yarn.nodemanager.localizer.cache.cleanup.interval-ms triggers deletion service for non-running containers. However, for containers that are running and spilling data to {'yarn.nodemanager.local-dirs'}/usercache/<user>/appcache/<app_id>, the deletion service does not come into action, as a result, filesystem gets full, nodes are marked unhealthy and application gets stuck.
I suggest you refer to an internal community article [1] which speaks about something similar.
I think that the upstream JIRA [YARN-4540] [2] has this documented and is yet to be unresolved. The general recommendation is to just make that FS big and if it gets full, debug the job that writes too much data into it.
Also, It is ok about deleting the usercache dir. Use the following steps to delete the usercache:
for i in `cat list_of_nodes_in_cluster`; do ssh $i rm -rf /data?/yarn/nm/usercache/* ; done
Please let us know if this is helpful.
[1] https://community.cloudera.com/t5/Support-Questions/yarn-usercache-folder-became-with-huge-size/td-p...
[2] https://issues.apache.org/jira/browse/YARN-4540
Created 11-05-2019 05:15 PM
Hi Gst,
Appreciate your assistance on this so far, regarding stopping Yarn Service in our cluster it would be difficult as we have Spark streaming jobs running, stopping those would require a lot of approvals etc, any other way to get the contents inside /data/yarn/nm/usercache/my.application/filecache/* deleted.
Regards
Wert
Created 11-05-2019 07:51 PM
Created 11-05-2019 08:14 PM
Created 11-05-2019 08:21 PM