Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Processors at 100% during CDH 5.2.0 Express installation

Processors at 100% during CDH 5.2.0 Express installation

New Contributor

Hi,

I'm experiencing problems installing CDH 5.2.0 Express via Cloudera Manager.

 

In short

Processors become far too busy on the node on which I started the installation right after its Step 4 (Cluster Installation) completes.

 

In detail

3 VMware nodes (say, namenode, datanode01, datanode02), each of which has:

- 2 processors, 4GB of RAM, 50GB of disk reserved,

- a fresh Ubuntu 14.04 LTS Server installation (only OpenSSH as an extra package).

Physical machine is a 4-core 16GB of RAM Windows 7 Professional box.

 

I tend to exclude that the number of cores is the problem (4 physical vs. 2 + 2 + 2 virtual + something for the host), since the processors on the two datanodes are nearly idle.

 

Network appears to be properly configured: each node reaches the others, both DNS and reverse DNS queries succeed. I didn't disable IPV6, though. Here's an excerpt from my hosts file (there should be no need to populate this file when using DNS, but just in case...):

...

192.168.0.70    namenode.my.domain namenode

192.168.0.71    datanode01.my.domain datanode01

192.168.0.72    datanode02.my.domain datanode02

 ...

 

I start the installation on namenode (launching cloudera-manager-installer.bin), then I proceed on Cloudera Manager, find the nodes by their DNS names and basically accept the proposed options. At Step 4 (Cluster Installation) the task on namenode completes before the tasks on the datanodes and, as soon as it completes, processors on that node start to get really busy. They breath sometimes, but stay basically busy. Even typing a command is a pain (when at all possible). Meanwhile, the tasks for the datanodes complete. Only once did I have the heart to proceed to the next step and, after a few hours, the distribution tasks succeeded only for the datanodes. I think the namenode was too irresponsive to complete the task.

The two processes that tie the processors up are:

 

/usr/lib/cmf/agent/build/env/bin/python /usr/lib/cmf/agent/src/cmf/agent.py --package_dir /usr/lib/cmf/service --agent_dir /var/run/cloudera-scm-agent --lib_dir /var/lib/cloudera-scm-agent --logfile /var/log/cloudera-scm-agent/cloudera-scm-agent.log

 

and

 

/usr/lib/jvm/java-7-oracle-cloudera/bin/java -cp .:lib/*:/usr/share/java/mysql-connector-java.jar:/usr/share/java/oracle-connector-java.jar -server -Dlog4j.configuration=file:/etc/cloudera-scm-server/log4j.properties -Dfile.encoding=UTF-8 -Dcmf.root.logger=INFO,LOGFILE -Dcmf.log.dir=/var/log/cloudera-scm-server -Dcmf.log.file=cloudera-scm-server.log -Dcmf.jetty.threshhold=WARN -Dcmf.schema.dir=/usr/share/cmf/schema -Djava.awt.headless=true -Djava.net.preferIPv4Stack=true -Dpython.home=/usr/share/cmf/python -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled -XX:+UseParNewGC -XX:+HeapDumpOnOutOfMemoryError -Xmx2G -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -XX:OnOutOfMemoryError=kill -9 %p com.cloudera.server.cmf.Main

 

 

Outstanding entries in SCM Agent log

 

[20/Nov/2014 21:25:44 +0000] 6020 Monitor-HostMonitor throttling_logger ERROR    (298 skipped) Failed to collect NTP metrics
Traceback (most recent call last):
  File "/usr/lib/cmf/agent/src/cmf/monitor/host/ntp_monitor.py", line 39, in collect
    result, stdout, stderr = self._subprocess_with_timeout(args, self._timeout)
  File "/usr/lib/cmf/agent/src/cmf/monitor/host/ntp_monitor.py", line 32, in _subprocess_with_timeout
    return subprocess_with_timeout(args, timeout)
  File "/usr/lib/cmf/agent/src/cmf/monitor/host/subprocess_timeout.py", line 40, in subprocess_with_timeout
    close_fds=True)
  File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
[20/Nov/2014 21:27:27 +0000] 6020 Monitor-HostMonitor throttling_logger ERROR    (145 skipped) Timeout with args ['/usr/lib/jvm/java-7-oracle-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.2.0.jar', 'com.cloudera.cmon.agent.DnsTest']
None


[20/Nov/2014 21:27:28 +0000] 6020 Monitor-HostMonitor throttling_logger ERROR    (145 skipped) Failed to collect java-based DNS names
Traceback (most recent call last):
  File "/usr/lib/cmf/agent/src/cmf/monitor/host/dns_names.py", line 64, in collect
    result, stdout, stderr = self._subprocess_with_timeout(args, self._poll_timeout)
  File "/usr/lib/cmf/agent/src/cmf/monitor/host/dns_names.py", line 46, in _subprocess_with_timeout
    return subprocess_with_timeout(args, timeout)
  File "/usr/lib/cmf/agent/src/cmf/monitor/host/subprocess_timeout.py", line 81, in subprocess_with_timeout
    raise Exception("timeout with args %s" % args)
Exception: timeout with args ['/usr/lib/jvm/java-7-oracle-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.2.0.jar', 'com.cloudera.cmon.agent.DnsTest']
[20/Nov/2014 21:33:34 +0000] 6020 Monitor-HostMonitor throttling_logger ERROR    (3 skipped) Kill subprocess exception with args ['/usr/lib/jvm/java-7-oracle-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.2.0.jar', 'com.cloudera.cmon.agent.DnsTest']
Traceback (most recent call last):
  File "/usr/lib/cmf/agent/src/cmf/monitor/host/subprocess_timeout.py", line 71, in subprocess_with_timeout
    os.kill(p.pid, signal.SIGTERM)
OSError: [Errno 3] No such process
[20/Nov/2014 21:45:04 +0000] 6020 Monitor-HostMonitor filesystem_map WARNING  Failed to join worker process collecting filesystem usage. All nodev filesystems will have unknown usage until the worker process is no longer active. Current nodev filesystems: /sys/fs/cgroup,/run,/run/lock,/run/shm,/run/user,/run/cloudera-scm-agent/process

 

[20/Nov/2014 21:45:04 +0000] 6020 Monitor-HostMonitor filesystem_map WARNING  Failed to join worker process collecting filesystem usage. All nodev filesystems will have unknown usage until the worker process is no longer active. Current nodev filesystems: /sys/fs/cgroup,/run,/run/lock,/run/shm,/run/user,/run/cloudera-scm-agent/process

 

I installed and configured ntpd (I admit I didn't before), but nothing changed.

 

 

Outstanding entries in SCM Agent log

 

2014-11-20 22:05:02,769 INFO JvmPauseMonitor:com.cloudera.enterprise.debug.JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC, or JVM not scheduled): paused approximately 1396ms: no GCs detected.

 

(there are hundreds of these)

 

I found a post about a similar issue on an already running system, but it is quite old and talked about a recognized and solved (or soon to be, at the time) bug.

 

Thank you for your help.

Best regards,

 

Stefano Altavilla

Don't have an account?
Coming from Hortonworks? Activate your account here