Hi Guys,
I'm getting from time to time that some NodeManagers got lost in Yarn as a result of log-dirs are bad: /var/log/hadoop-yarn/container.
Looking at the disk space and don't see any issue there, at the Resource manager i see:
INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_e37_1509251204123_1378_01_000001, NodeId: avpr-dhc001.lpdomain.com:8041, NodeHttpAddress: avpr-dhc001.lpdomain.com:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 172.16.144.140:8041 }, ] for AM appattempt_1509251204123_1378_000001
2017-10-29 05:08:22,593 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node avpr-dhc001.lpdomain.com:8041 reported UNHEALTHY with details: 1/1 log-dirs are bad: /liveperson/hadoop/log/hadoop-yarn/container
2017-10-29 05:08:22,593 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: avpr-dhc001.lpdomain.com:8041 Node Transitioned from RUNNING to UNHEALTHY
I don't see any issue in the DataNode or NodeManager logs.
No inode issue in the server.