Reply
Highlighted
New Contributor
Posts: 4
Registered: ‎05-14-2017

Failed applications on Yarn Nodemanager caused by CM's cgroups handling after cluster restart

Since upgrading to CM 5.10.0 we observed the following behaviour in our cluster, whenever restarting Yarn:

 

After restarting Yarn's Nodemanagers (rolling restart) about 10% of the Nodemanagers become unable to successfully execute tasks.

 

Error message example:

 

Diagnostics: Failed to create cgroup at /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn/container_e18_1489558335225_0257_02_000001
Failing this attempt. Failing the application.
2017-03-15 16:46:20,752 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs    OPERATION=Application Finished - Failed    TARGET=RMAppManager    RESULT=FAILURE    DESCRIPTION=App failed with state: FAILED    PERMISSIONS=Application application_1489558335225_0257 failed 2 times due to AM Container for appattempt_1489558335225_0257_000002 exited with  exitCode: -1
For more detailed output, check application tracking page:......./Then, click on links to logs of each attempt.
Diagnostics: Failed to create cgroup at /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn/container_e18_1489558335225_0257_02_000001
Failing this attempt. Failing the application.    APPID=application_1489558335225_0257
2017-03-15 16:46:39,761 WARN org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Unable to write fail flag file for application appattempt_1489558335225_0258_000002
org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=WRITE, inode="/tmp/hadoop-yarn":hdfs:supergroup:drwxr-xr-x
    at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationPro
    at 

 

Inspecting the affected nodes shows that /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn does not exist.

 

We could trace the error back to CM's code for handling the re-use and destruction of cgroups in lib/cmf/cgroups.py

 

As there has not been a fix released with recent versions of CM, any help with this issue would be greatly appreciated.

Explorer
Posts: 16
Registered: ‎01-11-2017

Re: Failed applications on Yarn Nodemanager caused by CM's cgroups handling after cluster restart

Hi tasch,

 

/tmp/hadoop-yarn have incorrect owner, it needs to bee yarn. 

 

/var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn, needs to be created in all nodemanegers.

 

Can you try to create this directories?

 

In HDFS:

 

hdfs dfs -mkdir /tmp/hadoop-yarn
hdfs dfs -chmod 755 /tmp/hadoop-yarn
hdfs dfs -chown yarn:supergroup /tmp/hadoop-yarn

In Operating System:

for NodeManager in $NodeManegerList
do
  ssh $NodeManager 'mkdir --mode=775 /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn'
  ssh $NodeManager 'chown yarn:hadoop /var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn'
done

 

Please if you found another solution, please share it.

 

It works for me in cdh 5.9.

 

Regards,

Marc Casajús.

New Contributor
Posts: 4
Registered: ‎05-14-2017

Re: Failed applications on Yarn Nodemanager caused by CM's cgroups handling after cluster restart

Seems like this was fixed in the newest Cloudera manager version

 

https://www.cloudera.com/documentation/enterprise/release-notes/topics/cm_rn_fixed_issues.html#OPSAP...

See "Proper termination and cleanup of YARN applications during decommissioning"

Explorer
Posts: 16
Registered: ‎01-11-2017

Re: Failed applications on Yarn Nodemanager caused by CM's cgroups handling after cluster restart

It's possible, but if you can not upgrade to the last version, you can try my steps to recreate manually.

Regards,
Marc.
Announcements