05-16-2017 03:22 AM
Since upgrading to CM 5.10.0 we observed the following behaviour in our cluster, whenever restarting Yarn:
After restarting Yarn's Nodemanagers (rolling restart) about 10% of the Nodemanagers become unable to successfully execute tasks.
Error message example:
Diagnostics: Failed to create cgroup at /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn/container_e18_1489558335225_0257_02_000001 Failing this attempt. Failing the application. 2017-03-15 16:46:20,752 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1489558335225_0257 failed 2 times due to AM Container for appattempt_1489558335225_0257_000002 exited with exitCode: -1 For more detailed output, check application tracking page:......./Then, click on links to logs of each attempt. Diagnostics: Failed to create cgroup at /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn/container_e18_1489558335225_0257_02_000001 Failing this attempt. Failing the application. APPID=application_1489558335225_0257 2017-03-15 16:46:39,761 WARN org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Unable to write fail flag file for application appattempt_1489558335225_0258_000002 org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=WRITE, inode="/tmp/hadoop-yarn":hdfs:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationPro at
Inspecting the affected nodes shows that /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn does not exist.
We could trace the error back to CM's code for handling the re-use and destruction of cgroups in lib/cmf/cgroups.py
As there has not been a fix released with recent versions of CM, any help with this issue would be greatly appreciated.
08-22-2017 02:45 AM
/tmp/hadoop-yarn have incorrect owner, it needs to bee yarn.
/var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn, needs to be created in all nodemanegers.
Can you try to create this directories?
hdfs dfs -mkdir /tmp/hadoop-yarn hdfs dfs -chmod 755 /tmp/hadoop-yarn hdfs dfs -chown yarn:supergroup /tmp/hadoop-yarn
In Operating System:
for NodeManager in $NodeManegerList do ssh $NodeManager 'mkdir --mode=775 /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn' ssh $NodeManager 'chown yarn:hadoop /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn' done
Please if you found another solution, please share it.
It works for me in cdh 5.9.
09-05-2017 10:26 PM
Seems like this was fixed in the newest Cloudera manager version
See "Proper termination and cleanup of YARN applications during decommissioning"
09-05-2017 10:36 PM