Created on 01-27-2016 12:47 AM - edited 09-16-2022 03:00 AM
Hello,
I have followed the instructions described here https://s3.amazonaws.com/quickstart-reference/cloudera/hadoop/latest/doc/Cloudera_EDH_on_AWS.pdf
for deploying EDH on AWS. Things seem fine all the way until boostrapping the cluster. I have the following cluster :
AMI : ami-30d9e02d
Cloudera Manager on d2.xlarge
2 masters (m4.2xlarge)
2 workers (d2.xlarge)
1 gateway (m4.2xlarge)
I used the AWS CloudFormation template and was able to connect to Cloudera Manager via the web console w/o problems. I deployed the cluster and all EC2 nodes are running with Status Checks ok (2/2), but the cluster fails at bootstrap.
I logged to one of the masters and I see the following at the bottom of /var/log/cloudera-scm-agent/cloudera-scm-agent.log :
[27/Jan/2016 03:16:44 +0000] 3069 MonitorDaemon-Reporter throttling_logger ERROR Error sending messages to firehose: MGMT-HOSTMONITOR-244944378552b77b5c898d702d752f7f Traceback (most recent call last): File "/usr/lib64/cmf/agent/src/cmf/monitor/firehose.py", line 116, in _send self._port) File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 469, in __init__ self.conn.connect() File "/usr/lib64/python2.6/httplib.py", line 720, in connect self.timeout) File "/usr/lib64/python2.6/socket.py", line 567, in create_connection raise error, msg error: [Errno 111] Connection refused [27/Jan/2016 03:19:46 +0000] 3069 DnsResolutionMonitor throttling_logger ERROR Timeout with args ['/usr/java/jdk1.7.0_67-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.5.1.jar', 'com.cloudera.cmon.agent.DnsTest'] None [27/Jan/2016 03:19:46 +0000] 3069 DnsResolutionMonitor throttling_logger ERROR Failed to run DnsTest. Traceback (most recent call last): File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 83, in collect_dns_metrics self._subprocess_with_timeout(args, self._poll_timeout) File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 55, in _subprocess_with_timeout return subprocess_with_timeout(args, timeout) File "/usr/lib64/cmf/agent/src/cmf/subprocess_timeout.py", line 94, in subprocess_with_timeout raise Exception("timeout with args %s" % args) Exception: timeout with args ['/usr/java/jdk1.7.0_67-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.5.1.jar', 'com.cloudera.cmon.agent.DnsTest']
and in the same file for the gateway :
[27/Jan/2016 03:20:27 +0000] 3072 DnsResolutionMonitor throttling_logger ERROR Timeout with args ['/usr/java/jdk1.7.0_67-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.5.1.jar', 'com.cloudera.cmon.agent.DnsTest'] None [27/Jan/2016 03:20:27 +0000] 3072 DnsResolutionMonitor throttling_logger ERROR Failed to run DnsTest. Traceback (most recent call last): File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 83, in collect_dns_metrics self._subprocess_with_timeout(args, self._poll_timeout) File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 55, in _subprocess_with_timeout return subprocess_with_timeout(args, timeout) File "/usr/lib64/cmf/agent/src/cmf/subprocess_timeout.py", line 94, in subprocess_with_timeout raise Exception("timeout with args %s" % args) Exception: timeout with args ['/usr/java/jdk1.7.0_67-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.5.1.jar', 'com.cloudera.cmon.agent.DnsTest']
and the same log file in the worker node has the following :
[27/Jan/2016 03:08:52 +0000] 3012 MainThread agent ERROR Failed to connect to previous supervisor. Traceback (most recent call last): File "/usr/lib64/cmf/agent/src/cmf/agent.py", line 1635, in find_or_start_supervisor self.configure_supervisor_clients() File "/usr/lib64/cmf/agent/src/cmf/agent.py", line 1882, in configure_supervisor_clients supervisor_options.realize(args=["-c", os.path.join(self.supervisor_dir, "supervisord.conf")]) File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/options.py", line 1564, in realize Options.realize(self, *arg, **kw) File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/options.py", line 311, in realize self.process_config() File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/options.py", line 319, in process_config self.process_config_file(do_usage) File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/options.py", line 354, in process_config_file self.usage(str(msg)) File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/options.py", line 142, in usage self.exit(2) SystemExit: 2
I'm pretty new to Cloudera and AWS so any insight is appreciated!
Created on 01-27-2016 04:10 AM - edited 01-27-2016 04:40 AM
> This indicates that agent is unable to connect to HOSTMONITOR, is HMON running?
[27/Jan/2016 03:16:44 +0000] 3069 MonitorDaemon-Reporter throttling_logger ERROR Error sending messages to firehose: MGMT-HOSTMONITOR-244944378552b77b5c898d702d752f7f
> This indicates that the command timed out when attempting to run a DNS test [1]
[27/Jan/2016 03:19:46 +0000] 3069 DnsResolutionMonitor throttling_logger ERROR Timeout with args ['/usr/java/jdk1.7.0_67-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.5.1.jar', 'com.cloudera.cmon.agent.DnsTest']
Example:
[bash]# /usr/java/jdk1.7.0_67-cloudera/bin/java -classpath /usr/share/cmf/lib/agent-5.5.1.jar com.cloudera.cmon.agent.DnsTest
> This indicates that supervisord [2] was not running at that time
[27/Jan/2016 03:08:52 +0000] 3012 MainThread agent ERROR Failed to connect to previous supervisor.
[1] http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_networknames_configure.html
[2] "Agent Process Supervision in Detail" http://blog.cloudera.com/blog/2013/07/how-does-cloudera-manager-work/
Created 01-27-2016 02:12 AM
I also tried deploying the cluster using the command line (cloudera-director) and using the aws.reference.conf file, the job stops with the following output on the terminal :
* Creating Sentry Database ... done * Waiting for firstRun on cluster C5-Reference-AWS ... done * Cloudera Manager 'First Run' command execution failed: Failed to perform First Run of services. ...
then I tried to check the status :
[ec2-user@ip-10-0-2-205 setup-default]$ cloudera-director status aws.reference.conf Process logs can be found at /home/ec2-user/.cloudera-director/logs/application.log Plugins will be loaded from /var/lib/cloudera-director-plugins Cloudera Director 2.0.0 initializing ... Unexpected internal error (see logs): Cluster C5-Reference-AWS is in stage BOOTSTRAP_FAILED. See cluster status and server logs for details.
Any help is appreciated!
Created on 01-27-2016 04:10 AM - edited 01-27-2016 04:40 AM
> This indicates that agent is unable to connect to HOSTMONITOR, is HMON running?
[27/Jan/2016 03:16:44 +0000] 3069 MonitorDaemon-Reporter throttling_logger ERROR Error sending messages to firehose: MGMT-HOSTMONITOR-244944378552b77b5c898d702d752f7f
> This indicates that the command timed out when attempting to run a DNS test [1]
[27/Jan/2016 03:19:46 +0000] 3069 DnsResolutionMonitor throttling_logger ERROR Timeout with args ['/usr/java/jdk1.7.0_67-cloudera/bin/java', '-classpath', '/usr/share/cmf/lib/agent-5.5.1.jar', 'com.cloudera.cmon.agent.DnsTest']
Example:
[bash]# /usr/java/jdk1.7.0_67-cloudera/bin/java -classpath /usr/share/cmf/lib/agent-5.5.1.jar com.cloudera.cmon.agent.DnsTest
> This indicates that supervisord [2] was not running at that time
[27/Jan/2016 03:08:52 +0000] 3012 MainThread agent ERROR Failed to connect to previous supervisor.
[1] http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_networknames_configure.html
[2] "Agent Process Supervision in Detail" http://blog.cloudera.com/blog/2013/07/how-does-cloudera-manager-work/