Support Questions

Find answers, ask questions, and share your expertise

log-dirs are bad: /var/log/hadoop-yarn/container

avatar
Master Collaborator

Hi Guys,

 

I'm getting from time to time that some NodeManagers got lost in Yarn as a result of log-dirs are bad: /var/log/hadoop-yarn/container.

 

Looking at the disk space and don't see any issue there, at the Resource manager i see:

 

INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_e37_1509251204123_1378_01_000001, NodeId: avpr-dhc001.lpdomain.com:8041, NodeHttpAddress: avpr-dhc001.lpdomain.com:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 172.16.144.140:8041 }, ] for AM appattempt_1509251204123_1378_000001
2017-10-29 05:08:22,593 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node avpr-dhc001.lpdomain.com:8041 reported UNHEALTHY with details: 1/1 log-dirs are bad: /liveperson/hadoop/log/hadoop-yarn/container
2017-10-29 05:08:22,593 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: avpr-dhc001.lpdomain.com:8041 Node Transitioned from RUNNING to UNHEALTHY

 

 

I don't see any issue in the DataNode or NodeManager logs.

No inode issue in the server.

 

 

 

 

1 REPLY 1

avatar
Master Collaborator

The problem was the limitation of sub directory under specific dir

 

so when checking the folder container i see there is 32,000 directories which is the limit.

 

looking why the retention isnot deleting these files and i have the following conf:

 

Log Aggregation Retention Period 7 days
 
Job History Files Cleaner Interval 1 day
 
Log Retain Duration 3 hours