Created on 05-16-2017 03:22 AM - edited 09-16-2022 04:37 AM
Since upgrading to CM 5.10.0 we observed the following behaviour in our cluster, whenever restarting Yarn:
After restarting Yarn's Nodemanagers (rolling restart) about 10% of the Nodemanagers become unable to successfully execute tasks.
Error message example:
Diagnostics: Failed to create cgroup at /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn/container_e18_1489558335225_0257_02_000001 Failing this attempt. Failing the application. 2017-03-15 16:46:20,752 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1489558335225_0257 failed 2 times due to AM Container for appattempt_1489558335225_0257_000002 exited with exitCode: -1 For more detailed output, check application tracking page:......./Then, click on links to logs of each attempt. Diagnostics: Failed to create cgroup at /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn/container_e18_1489558335225_0257_02_000001 Failing this attempt. Failing the application. APPID=application_1489558335225_0257 2017-03-15 16:46:39,761 WARN org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Unable to write fail flag file for application appattempt_1489558335225_0258_000002 org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=WRITE, inode="/tmp/hadoop-yarn":hdfs:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationPro at
Inspecting the affected nodes shows that /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn does not exist.
We could trace the error back to CM's code for handling the re-use and destruction of cgroups in lib/cmf/cgroups.py
As there has not been a fix released with recent versions of CM, any help with this issue would be greatly appreciated.
/tmp/hadoop-yarn have incorrect owner, it needs to bee yarn.
/var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn, needs to be created in all nodemanegers.
Can you try to create this directories?
hdfs dfs -mkdir /tmp/hadoop-yarn hdfs dfs -chmod 755 /tmp/hadoop-yarn hdfs dfs -chown yarn:supergroup /tmp/hadoop-yarn
In Operating System:
for NodeManager in $NodeManegerList do ssh $NodeManager 'mkdir --mode=775 /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn' ssh $NodeManager 'chown yarn:hadoop /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn' done
Please if you found another solution, please share it.
It works for me in cdh 5.9.
Seems like this was fixed in the newest Cloudera manager version
See "Proper termination and cleanup of YARN applications during decommissioning"