Support Questions

Fawze · ‎10-29-2017

Hi Guys,

I'm getting from time to time that some NodeManagers got lost in Yarn as a result of log-dirs are bad: /var/log/hadoop-yarn/container.

Looking at the disk space and don't see any issue there, at the Resource manager i see:

INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_e37_1509251204123_1378_01_000001, NodeId: avpr-dhc001.lpdomain.com:8041, NodeHttpAddress: avpr-dhc001.lpdomain.com:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 172.16.144.140:8041 }, ] for AM appattempt_1509251204123_1378_000001
2017-10-29 05:08:22,593 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node avpr-dhc001.lpdomain.com:8041 reported UNHEALTHY with details: 1/1 log-dirs are bad: /liveperson/hadoop/log/hadoop-yarn/container
2017-10-29 05:08:22,593 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: avpr-dhc001.lpdomain.com:8041 Node Transitioned from RUNNING to UNHEALTHY

I don't see any issue in the DataNode or NodeManager logs.

No inode issue in the server.

Fawze · ‎10-29-2017

The problem was the limitation of sub directory under specific dir

so when checking the folder container i see there is 32,000 directories which is the limit.

looking why the retention isnot deleting these files and i have the following conf:

Log Aggregation Retention Period 7 days

Job History Files Cleaner Interval 1 day

Log Retain Duration 3 hours