Seems like this was fixed in the newest Cloudera manager version https://www.cloudera.com/documentation/enterprise/release-notes/topics/cm_rn_fixed_issues.html#OPSAPS-40731 See "Proper termination and cleanup of YARN applications during decommissioning"
... View more
Since upgrading to CM 5.10.0 we observed the following behaviour in our cluster, whenever restarting Yarn: After restarting Yarn's Nodemanagers (rolling restart) about 10% of the Nodemanagers become unable to successfully execute tasks. Error message example: Diagnostics: Failed to create cgroup at /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn/container_e18_1489558335225_0257_02_000001
Failing this attempt. Failing the application.
2017-03-15 16:46:20,752 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1489558335225_0257 failed 2 times due to AM Container for appattempt_1489558335225_0257_000002 exited with exitCode: -1
For more detailed output, check application tracking page:......./Then, click on links to logs of each attempt.
Diagnostics: Failed to create cgroup at /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn/container_e18_1489558335225_0257_02_000001
Failing this attempt. Failing the application. APPID=application_1489558335225_0257
2017-03-15 16:46:39,761 WARN org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Unable to write fail flag file for application appattempt_1489558335225_0258_000002
org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=WRITE, inode="/tmp/hadoop-yarn":hdfs:supergroup:drwxr-xr-x
at Inspecting the affected nodes shows that /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn does not exist. We could trace the error back to CM's code for handling the re-use and destruction of cgroups in lib/cmf/cgroups.py As there has not been a fix released with recent versions of CM, any help with this issue would be greatly appreciated.
... View more