Support Questions

wert_1311 · ‎10-31-2019

Hi

I am running out of space on one of our DN, and found that Yarn container logs are consuming around 50GB, though there are 3 Spark jobs/applications running which can be seen under Yarn > Applications tab, but I see a lot of other application ids also listed, would like to know if these could be deleted to reclaim space (the ones which are not running)

Secondly, any precautions to be taken into consideration while deleting these container logs. Any guidance would be of great help.

Regards

Wert

gsthina · ‎11-05-2019

Hi @wert_1311,

There is an option to just stop a single NodeManager (NM) and clean that
usercache there. So, there will not be any applications affected due to
this. However, it is worth keeping in mind, even if you stop a single
NodeManager, it has some effect on the currently running jobs. The jobs
running on that NM will be stopped and will be restarted on another NM. So,
jobs will run longer than expected because the containers have to start
again somewhere else.

Hope this helps.

View solution in original post

gsthina · ‎10-31-2019

Hey,

I think, deleting container logs may be a good option to save space. However, if you would like to grab the yarn logs for analysing the old jobs, then you may need those container logs.

Also, I think those analyses are required, when a job fails. So, if you think those jobs will not be dug again to gather any of the historic insights, then you may feel free to clear them.

EricL · ‎10-31-2019

@wert_1311 ,

Can you confirm what's the location of the log that is filled up? Can you go to CM to confirm if it is really the container log directory?

wert_1311 · ‎11-01-2019

@gsthina Thanks for your reply, post clean-up container logs are at comfortable at 660M.

@EricL path for container logs are (/data/yarn/container-logs), since I have you guys with me on this I would also like to check something about yarn filecache.

I see it’s consuming around 9.3G of space any way we can have this reduced? via an automated way?

Following are the Yarn settings which I, think controls filecache…. However, you guys are the experts

yarn.nodemanager.localizer.cache.target-size-mb = 10GB

yarn.nodemanager.localizer.cache.cleanup.interval-ms = 10 Minutes

Spce Used:

du -shc /data/yarn/nm/usercache/MyApplication/*

981M /data/yarn/nm/usercache/ MyApplication /appcache

9.3G /data/yarn/nm/usercache/ MyApplication /filecache

gsthina · ‎11-01-2019

Hi @wert_1311 ,

Thanks for asking.

Currently yarn.nodemanager.localizer.cache.target-size-mb and yarn.nodemanager.localizer.cache.cleanup.interval-ms triggers deletion service for non-running containers. However, for containers that are running and spilling data to {'yarn.nodemanager.local-dirs'}/usercache/<user>/appcache/<app_id>, the deletion service does not come into action, as a result, filesystem gets full, nodes are marked unhealthy and application gets stuck.

I suggest you refer to an internal community article [1] which speaks about something similar.

I think that the upstream JIRA [YARN-4540] [2] has this documented and is yet to be unresolved. The general recommendation is to just make that FS big and if it gets full, debug the job that writes too much data into it.

Also, It is ok about deleting the usercache dir. Use the following steps to delete the usercache:

Stop the YARN service.
Log in to all nodes and delete the content of the usercache directories.
For example:

for i in `cat list_of_nodes_in_cluster`; do ssh $i rm -rf /data?/yarn/nm/usercache/* ; done

Verify all usercache directories on all nodes are empty.
Start the YARN service.

Please let us know if this is helpful.

[1] https://community.cloudera.com/t5/Support-Questions/yarn-usercache-folder-became-with-huge-size/td-p...
[2] https://issues.apache.org/jira/browse/YARN-4540

wert_1311 · ‎11-05-2019

Hi Gst,

Appreciate your assistance on this so far, regarding stopping Yarn Service in our cluster it would be difficult as we have Spark streaming jobs running, stopping those would require a lot of approvals etc, any other way to get the contents inside /data/yarn/nm/usercache/my.application/filecache/* deleted.

Regards

Wert

gsthina · ‎11-05-2019

Hi @wert_1311,

There is an option to just stop a single NodeManager (NM) and clean that
usercache there. So, there will not be any applications affected due to
this. However, it is worth keeping in mind, even if you stop a single
NodeManager, it has some effect on the currently running jobs. The jobs
running on that NM will be stopped and will be restarted on another NM. So,
jobs will run longer than expected because the containers have to start
again somewhere else.

Hope this helps.

wert_1311 · ‎11-05-2019

Hi @gsthina

I really appreciate all your time & assistance this topic. Thanks a lot.

Regards

Wert

gsthina · ‎11-05-2019

Hi @wret_1311,

Thanks for your response and I appreciate for confirming the solution. I'm glad, it helped you 🙂