Created on 02-21-2017 03:38 AM - edited 09-16-2022 04:07 AM
General description:
Premounted cgroups on ubuntu 14.04 breakes startup of NodeManager.
Software version:
OS: ubuntu14.04
Linux kernal: 3.13.0-108-generic #155-Ubuntu SMP Wed Jan 11 16:58:52 UTC 2017 x86_64 GNU/Linux
Cloudera manager: 5.8.4-1
Cloudera agent: 5.8.4-1
CDH parcel:5.8.2
Detailed description:
Ubuntu 14.04 mounts cgroups automatically on startup after installation of cgroup-lite (for e.x. docker.io and libvirt-bin depends on it) to /sys/fs/cgroup/ like this:
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,relatime,cpuset) cgroup on /sys/fs/cgroup/cpu type cgroup (rw,relatime,cpu) cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,relatime,cpuacct) cgroup on /sys/fs/cgroup/memory type cgroup (rw,relatime,memory) cgroup on /sys/fs/cgroup/devices type cgroup (rw,relatime,devices) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,relatime,freezer) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,relatime,blkio) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,relatime,perf_event) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,relatime,hugetlb)
cloudera-scm-agent succefully detects those and reports about it to the log:
[03/Feb/2017 16:45:15 +0000] 1903 MainThread agent INFO Agent starting as pid 1903 user root(0) group root(0).
[21/Feb/2017 10:09:05 +0000] 14054 MainThread agent INFO At least one outstanding cgroup; retaining cgroup mounts [21/Feb/2017 10:09:08 +0000] 20837 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/cgroups [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found existing subsystem cpu at /sys/fs/cgroup/cpu [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found existing subsystem cpuacct at /sys/fs/cgroup/cpuacct [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found existing subsystem memory at /sys/fs/cgroup/memory [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found existing subsystem blkio at /sys/fs/cgroup/blkio [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found cgroups subsystem: cpu [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO cgroup pseudofile /sys/fs/cgroup/cpu/cpu.rt_runtime_us does not exist, skipping [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found cgroups subsystem: cpuacct [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found cgroups subsystem: memory [21/Feb/2017 10:09:08 +0000] 20837 MainThread cgroups INFO Found cgroups subsystem: blkio [21/Feb/2017 10:09:08 +0000] 20837 MainThread agent INFO Found cgroups capabilities: {'has_memory': True, 'default_memory_limit_in_bytes': -1, 'default_memory_soft_limit_in_bytes': -1, 'writable_cgroup_dot_procs': True, 'default_cpu_rt_runtime_us': -1, 'has_cpu': True, 'default_blkio_weight': 1000, 'default_cpu_shares': 1024, 'has_cpuacct': True, 'has_blkio': True}
The ubuntu's default policies autolocate process to the default location under dedicated user's folder /user/0.user/ :
1862 ? Ss 0:27 /usr/lib/cmf/agent/build/env/bin/python /usr/lib/cmf/agent/build/env/bin/supervisord 1872 ? S 0:00 \_ python2.7 /usr/lib/cmf/agent/build/env/bin/cmf-listener -l /var/log/cloudera-scm-agent/cmf_listener.log /run/cloudera-scm-agent/events 2676 ? Sl 2:46 \_ /usr/lib/jvm/java-8-oracle//bin/java -Dproc_datanode -Xmx1000m -Dhdfs.audit.logger=INFO,RFAAUDIT -Dsecurity.audit.logger=INFO,RFAS -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop22245 ? Sl 0:00 \_ python2.7 /usr/lib/cmf/agent/build/env/bin/flood 21957 ? Ssl 0:01 python2.7 /usr/lib/cmf/agent/build/env/bin/cmf-agent --package_dir /usr/lib/cmf/service --agent_dir /var/run/cloudera-scm-agent --lib_dir /var/lib/cloudera-scm-agent --logfile /var/log/cloudera-scm-agent/cloudera-r #cat /proc/21957/cgroup 11:name=systemd:/user/0.user/5.session 10:hugetlb:/user/0.user/5.session 9:perf_event:/user/0.user/5.session 8:blkio:/user/0.user/5.session 7:freezer:/user/0.user/5.session 6:devices:/user/0.user/5.session 5:memory:/user/0.user/5.session 4:cpuacct:/user/0.user/5.session 3:cpu:/user/0.user/5.session 2:cpuset:/ #cat /proc/1862/cgroup 11:name=systemd:/user/0.user/c1.session 10:hugetlb:/user/0.user/c1.session 9:perf_event:/user/0.user/c1.session 8:blkio:/user/0.user/c1.session 7:freezer:/user/0.user/c1.session 6:devices:/user/0.user/c1.session 5:memory:/cloudera 4:cpuacct:/user/0.user/c1.session 3:cpu:/cloudera 2:cpuset:/
The corresponded cpu folder structure looks like this after datanode started:
# ll /sys/fs/cgroup/cpu/user/0.user/c1.session/ total 0 drwxr-xr-x 3 root root 0 Feb 21 09:21 ./ drwxr-xr-x 5 root root 0 Feb 20 16:59 ../ drwxr-xr-x 2 root root 0 Feb 20 15:10 757-hdfs-DATANODE/ -rw-r--r-- 1 root root 0 Feb 20 15:10 cgroup.clone_children --w--w--w- 1 root root 0 Feb 20 15:10 cgroup.event_control -rw-r--r-- 1 root root 0 Feb 20 15:10 cgroup.procs -rw-r--r-- 1 root root 0 Feb 20 15:10 cpu.cfs_period_us -rw-r--r-- 1 root root 0 Feb 20 15:10 cpu.cfs_quota_us -rw-r--r-- 1 root root 0 Feb 20 15:10 cpu.shares -r--r--r-- 1 root root 0 Feb 20 15:10 cpu.stat -rw-r--r-- 1 root root 0 Feb 20 15:10 notify_on_release -rw-r--r-- 1 root root 0 Feb 20 15:10 tasks
Next when I try to start YARN node manager from cloudera manager:
Feb 21, 9:21:49.447 AM INFO org.apache.hadoop.service.AbstractService Service NodeManager failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:221) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:514) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:561) Caused by: java.io.IOException: Not able to enforce cpu weights; cannot write to cgroup at: /sys/fs/cgroup/cpu at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.initializeControllerPaths(CgroupsLCEResourcesHandler.java:502) at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.init(CgroupsLCEResourcesHandler.java:154) at org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler.init(CgroupsLCEResourcesHandler.java:137) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:215) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:219) ... 3 more Feb 21, 9:21:49.486 AM DEBUG org.apache.hadoop.service.AbstractService Service: NodeManager entered state STOPPED
In cloudera manager yarn.nodemanager.linux-container-executor.cgroups.hierarchy set to '/hadoop-yarn'
I created a /sys/fs/cgroup/cpu/hadoop-yarn cgroup manually and gave yarn user permissions 777. But got the error again.
I straced the nodemanager java process and got as last system call this. which is
[pid 11431] access("/sys/fs/cgroup/cpu/u/s/e/r/0/./u/s/e/r/c/4/./s/e/s/s/i/o/n/hadoop-yarn", W_OK) = -1 ENOENT (No such file or directory)
This looks really strange. It seems something wrong with replacements.
Here is yarn/yarn.sh ["nodemanager"] strerr:
+ echo CONF_DIR=/run/cloudera-scm-agent/process/797-yarn-NODEMANAGER + echo CMF_CONF_DIR=/etc/cloudera-scm-agent + EXCLUDE_CMF_FILES=('cloudera-config.sh' 'httpfs.sh' 'hue.sh' 'impala.sh' 'sqoop.sh' 'supervisor.conf' '*.log' '*.keytab' '*jceks') ++ printf '! -name %s ' cloudera-config.sh httpfs.sh hue.sh impala.sh sqoop.sh supervisor.conf '*.log' yarn.keytab '*jceks' + find /run/cloudera-scm-agent/process/797-yarn-NODEMANAGER -type f '!' -path '/run/cloudera-scm-agent/process/797-yarn-NODEMANAGER/logs/*' '!' -name cloudera-config.sh '!' -name httpfs.sh '!' -name hue.sh '!' -name impala.sh '!' -name sqoop.sh '!' -name supervisor.conf '!' -name '*.log' '!' -name yarn.keytab '!' -name '*jceks' -exec perl -pi -e 's#{{CMF_CONF_DIR}}#/run/cloudera-scm-agent/process/797-yarn-NODEMANAGER#g' '{}' ';' Can't open /run/cloudera-scm-agent/process/797-yarn-NODEMANAGER/container-executor.cfg: Permission denied. + perl -pi -e 's#{{CGROUP_GROUP_CPU}}#u/s/e/r///0/./u/s/e/r///4/./s/e/s/s/i/o/n#g' /run/cloudera-scm-agent/process/797-yarn-NODEMANAGER/yarn-site.xml
Checked furthen and found that the bug is in agent.py at /usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.8.4-py2.7.egg/cmf/agent.py at method update_process_environment_for_cgroups at line 3318
group = '/'.join(group)
Which do next for group string
[21/Feb/2017 11:23:41 +0000] 33551 MainThread agent INFO Set ENV from agent cgroups before '/'.join(group) CPU user/0.user/4.session [21/Feb/2017 11:23:41 +0000] 33551 MainThread agent INFO Set ENV from agent cgroups after '/'.join(group) CPU u/s/e/r///0/./u/s/e/r///4/./s/e/s/s/i/o/n
Is it intended to be that way? Can you fix it? Potentially this affects not only NodeManager but Impala too.
Thanks,
Alexander Yasnogor
Created 04-14-2017 12:30 PM
We are having the same issue with CDH 5.5.4/CM 5.7.2 and wiht CDH/CM 5.10.1