Created on 10-26-202207:36 PM - edited 10-26-202207:38 PM
Symptoms:
In versions prior to CDH 6.3.1, Node Managers can enter unhealthy state with below error observed in NM logs
2022-10-20 15:31:32,487 ERROR logaggregation.AggregatedLogFormat (AggregatedLogFormat.java:logErrorMessage(299)) - Error aggregating log file. Log file : /hadoop/ssd01/yarn/log/application_1665989140069_135925/container_e93_1665989140069_135925_01_000002/history.txt.appattempt_1665989140069_135925_000001. /hadoop/ssd01/yarn/log/application_1665989140069_135925/container_e93_1665989140069_135925_01_000002/history.txt.appattempt_1665989140069_135925_000001 (Permission denied)
2022-10-20 15:28:19,556 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(532)) - Exit code: 35
2022-10-20 15:28:19,556 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(532)) - Exception message: Launch container failed
2022-10-20 15:28:19,556 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(532)) - Shell error output: Could not create container dirsCould not create local files and directories
2022-10-20 15:28:19,557 ERROR launcher.ContainerLaunch (ContainerLaunch.java:call(327)) - Failed to launch container due to configuration error.
org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container Executor reached unrecoverable exception
Permissions for yarn_nodemanager_local_dirs needs to checked and rectified if they are not correct.
The actual issue is that most of these exit codes doesn't fall under the criteria where NM should be marked unhealthy. Based on above we might hitting known Issue