Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Premounted cgroups don't work with NodeManager

avatar
New Contributor

General description:

Premounted cgroups on ubuntu 14.04 breakes startup of NodeManager.

 

Software version:

OS: ubuntu14.04

Linux kernal: 3.13.0-108-generic #155-Ubuntu SMP Wed Jan 11 16:58:52 UTC 2017 x86_64 GNU/Linux

Cloudera manager:  5.8.4-1

Cloudera agent:  5.8.4-1

CDH parcel:5.8.2

 

Detailed description:

Ubuntu 14.04 mounts cgroups automatically on startup after installation of cgroup-lite (for e.x. docker.io and libvirt-bin depends on it) to /sys/fs/cgroup/ like this:

cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu type cgroup (rw,relatime,cpu)
cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,relatime,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,relatime,perf_event)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,relatime,hugetlb)

 

cloudera-scm-agent succefully detects those and reports about it to the log:

[03/Feb/2017 16:45:15 +0000] 1903 MainThread agent        INFO     Agent starting as pid 1903 user root(0) group root(0).
[21/Feb/2017 10:09:05 +0000] 14054 MainThread agent INFO At least one outstanding cgroup; retaining cgroup mounts [21/Feb/2017 10:09:08 +0000] 20837 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/cgroups [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found existing subsystem cpu at /sys/fs/cgroup/cpu [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found existing subsystem cpuacct at /sys/fs/cgroup/cpuacct [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found existing subsystem memory at /sys/fs/cgroup/memory [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found existing subsystem blkio at /sys/fs/cgroup/blkio [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found cgroups subsystem: cpu [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO cgroup pseudofile /sys/fs/cgroup/cpu/cpu.rt_runtime_us does not exist, skipping [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found cgroups subsystem: cpuacct [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found cgroups subsystem: memory [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found cgroups subsystem: blkio [21/Feb/2017 10:09:08 +0000] 20837 MainThread agent INFO Found cgroups capabilities: {'has_memory': True, 'default_memory_limit_in_bytes': -1, 'default_memory_soft_limit_in_bytes': -1, 'writable_cgroup_dot_procs': True, 'default_cpu_rt_runtime_us': -1, 'has_cpu': True, 'default_blkio_weight': 1000, 'default_cpu_shares': 1024, 'has_cpuacct': True, 'has_blkio': True}

 

The ubuntu's default policies autolocate process to the default location under dedicated user's folder /user/0.user/ :

 1862 ?        Ss     0:27 /usr/lib/cmf/agent/build/env/bin/python /usr/lib/cmf/agent/build/env/bin/supervisord
 1872 ?        S      0:00  \_ python2.7 /usr/lib/cmf/agent/build/env/bin/cmf-listener -l /var/log/cloudera-scm-agent/cmf_listener.log /run/cloudera-scm-agent/events
 2676 ?        Sl     2:46  \_ /usr/lib/jvm/java-8-oracle//bin/java -Dproc_datanode -Xmx1000m -Dhdfs.audit.logger=INFO,RFAAUDIT -Dsecurity.audit.logger=INFO,RFAS -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop22245 ?        Sl     0:00  \_ python2.7 /usr/lib/cmf/agent/build/env/bin/flood
21957 ?        Ssl    0:01 python2.7 /usr/lib/cmf/agent/build/env/bin/cmf-agent --package_dir /usr/lib/cmf/service --agent_dir /var/run/cloudera-scm-agent --lib_dir /var/lib/cloudera-scm-agent --logfile /var/log/cloudera-scm-agent/cloudera-r


#cat /proc/21957/cgroup
11:name=systemd:/user/0.user/5.session
10:hugetlb:/user/0.user/5.session
9:perf_event:/user/0.user/5.session
8:blkio:/user/0.user/5.session
7:freezer:/user/0.user/5.session
6:devices:/user/0.user/5.session
5:memory:/user/0.user/5.session
4:cpuacct:/user/0.user/5.session
3:cpu:/user/0.user/5.session
2:cpuset:/

#cat /proc/1862/cgroup
11:name=systemd:/user/0.user/c1.session
10:hugetlb:/user/0.user/c1.session
9:perf_event:/user/0.user/c1.session
8:blkio:/user/0.user/c1.session
7:freezer:/user/0.user/c1.session
6:devices:/user/0.user/c1.session
5:memory:/cloudera
4:cpuacct:/user/0.user/c1.session
3:cpu:/cloudera
2:cpuset:/

 

The corresponded cpu folder structure looks like this after datanode started:

# ll /sys/fs/cgroup/cpu/user/0.user/c1.session/
total 0
drwxr-xr-x 3 root root 0 Feb 21 09:21 ./
drwxr-xr-x 5 root root 0 Feb 20 16:59 ../
drwxr-xr-x 2 root root 0 Feb 20 15:10 757-hdfs-DATANODE/
-rw-r--r-- 1 root root 0 Feb 20 15:10 cgroup.clone_children
--w--w--w- 1 root root 0 Feb 20 15:10 cgroup.event_control
-rw-r--r-- 1 root root 0 Feb 20 15:10 cgroup.procs
-rw-r--r-- 1 root root 0 Feb 20 15:10 cpu.cfs_period_us
-rw-r--r-- 1 root root 0 Feb 20 15:10 cpu.cfs_quota_us
-rw-r--r-- 1 root root 0 Feb 20 15:10 cpu.shares
-r--r--r-- 1 root root 0 Feb 20 15:10 cpu.stat
-rw-r--r-- 1 root root 0 Feb 20 15:10 notify_on_release
-rw-r--r-- 1 root root 0 Feb 20 15:10 tasks

Next when I try to start YARN node manager from cloudera manager:

 

Feb 21, 9:21:49.447 AM	INFO	org.apache.hadoop.service.AbstractService	
Service NodeManager failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:221)
	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:514)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:561)
Caused by: java.io.IOException: Not able to enforce cpu weights; cannot write to cgroup at: /sys/fs/cgroup/cpu
	at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.initializeControllerPaths(CgroupsLCEResourcesHandler.java:502)
	at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.init(CgroupsLCEResourcesHandler.java:154)
	at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.init(CgroupsLCEResourcesHandler.java:137)
	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:215)
	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:219)
	... 3 more
Feb 21, 9:21:49.486 AM	DEBUG	org.apache.hadoop.service.AbstractService	
Service: NodeManager entered state STOPPED

In cloudera manager yarn.nodemanager.linux-container-executor.cgroups.hierarchy set to '/hadoop-yarn'

I created a /sys/fs/cgroup/cpu/hadoop-yarn cgroup manually and gave yarn user permissions 777. But got the error again.

I straced the nodemanager java process and got as last system call this. which is 

 

 

 [pid 11431] access("/sys/fs/cgroup/cpu/u/s/e/r/0/./u/s/e/r/c/4/./s/e/s/s/i/o/n/hadoop-yarn", W_OK) = -1 ENOENT (No such file or directory)

 

 

 

This looks really strange. It seems something wrong with replacements.

Here is yarn/yarn.sh ["nodemanager"] strerr:

 

+ echo CONF_DIR=/run/cloudera-scm-agent/process/797-yarn-NODEMANAGER
+ echo CMF_CONF_DIR=/etc/cloudera-scm-agent
+ EXCLUDE_CMF_FILES=('cloudera-config.sh' 'httpfs.sh' 'hue.sh' 'impala.sh' 'sqoop.sh' 'supervisor.conf' '*.log' '*.keytab' '*jceks')
++ printf '! -name %s ' cloudera-config.sh httpfs.sh hue.sh impala.sh sqoop.sh supervisor.conf '*.log' yarn.keytab '*jceks'
+ find /run/cloudera-scm-agent/process/797-yarn-NODEMANAGER -type f '!' -path '/run/cloudera-scm-agent/process/797-yarn-NODEMANAGER/logs/*' '!' -name cloudera-config.sh '!' -name httpfs.sh '!' -name hue.sh '!' -name impala.sh '!' -name sqoop.sh '!' -name supervisor.conf '!' -name '*.log' '!' -name yarn.keytab '!' -name '*jceks' -exec perl -pi -e 's#{{CMF_CONF_DIR}}#/run/cloudera-scm-agent/process/797-yarn-NODEMANAGER#g' '{}' ';'
Can't open /run/cloudera-scm-agent/process/797-yarn-NODEMANAGER/container-executor.cfg: Permission denied.
+ perl -pi -e 's#{{CGROUP_GROUP_CPU}}#u/s/e/r///0/./u/s/e/r///4/./s/e/s/s/i/o/n#g' /run/cloudera-scm-agent/process/797-yarn-NODEMANAGER/yarn-site.xml

 

 

Checked furthen and found that the bug is in agent.py at /usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.8.4-py2.7.egg/cmf/agent.py  at method update_process_environment_for_cgroups  at line 3318

 

 group = '/'.join(group)

Which do next for group string

 

 

[21/Feb/2017 11:23:41 +0000] 33551 MainThread agent        INFO     Set ENV from agent cgroups before '/'.join(group) CPU user/0.user/4.session
[21/Feb/2017 11:23:41 +0000] 33551 MainThread agent        INFO     Set ENV from agent cgroups after '/'.join(group) CPU u/s/e/r///0/./u/s/e/r///4/./s/e/s/s/i/o/n

 

Is it intended to be that way? Can you fix it? Potentially this affects not only NodeManager but Impala too.

 

Thanks,

Alexander Yasnogor

 

1 REPLY 1

avatar
Explorer

We are having the same issue with CDH 5.5.4/CM 5.7.2 and wiht CDH/CM 5.10.1