Created on 08-03-2014 10:57 AM - edited 09-16-2022 02:04 AM
I am running out of things to try here to finalize the upgrade on the cloudera manager agents. I finally gave up trying to fix it and reinstalled the whole cloudera manager service, to no avail.
$ rpm -qa 'cloudera-*'
cloudera-manager-agent-5.1.1-1.cm511.p0.82.el6.x86_64
cloudera-manager-server-db-2-5.1.1-1.cm511.p0.82.el6.x86_64
cloudera-manager-repository-5.0-1.noarch #Does this alarm anyone? I uninstalled the repository and reinstalled hoping to get 5.1 version repo but it installed this one again.
cloudera-manager-server-5.1.1-1.cm511.p0.82.el6.x86_64
cloudera-manager-daemons-5.1.1-1.cm511.p0.82.el6.x86_64
$ java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
The error happens once I log back into the cloudera manager service to upgrade the cm agents, it fails pretty fast (in a couple of seconds):
Detecting Cloudera Manager Server...
I tried to replicate that part of the code on the terminal:
$ python -c 'import socket; import sys; s = socket.socket(socket.AF_INET); s.settimeout(5.0); s.connect(("hadoop-test.in.wellcentive.com", int(7182))); s.close();'
#the line above returns nothing, which indicates success. This is proven by teh code below where I try a fake name in for the hostname.
The following is a test to see if I could get it to give me an error.
$ python -c 'import socket; import sys; s = socket.socket(socket.AF_INET); s.settimeout(5.0); s.connect(("fakehadoop-test.in.wellcentive.com", int(7182))); s.close();'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<string>", line 1, in connect
socket.gaierror: [Errno -2] Name or service not known
This is what I see in the logs:
Nothing is revealed in the cloudera-scm-server.log:
2014-08-03 11:41:52,134 INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from EXECUTE_SCRIPT (PT1.043S) to SCRIPT_START
2014-08-03 11:41:52,134 INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from SCRIPT_START (PT0S) to TAKE_LOCK
2014-08-03 11:41:52,135 INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from TAKE_LOCK (PT0.001S) to DETECT_ROOT
2014-08-03 11:41:52,135 INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from DETECT_ROOT (PT0S) to DETECT_DISTRO
2014-08-03 11:41:52,135 INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from DETECT_DISTRO (PT0S) to DETECT_SCM
2014-08-03 11:41:52,135 INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@499] hadoop-test.in.wellcentive.com: New state is a backward state. Storing failed state
2014-08-03 11:41:52,135 INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from DETECT_SCM (PT0S) to WAITING_FOR_ROLLBACK
cloudera-scm-agent.log shows a similar error to the UI output
[02/Aug/2014 23:55:00 +0000] 1718 MonitorDaemon-Reporter throttling_logger ERROR Error sending messages to firehose: mgmt-HOSTMONITOR-f84ed02fa45233b5b3c7d24e567ca229
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/src/cmf/monitor/firehose.py", line 71, in _send
self._port)
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 464, in __init__
self.conn.connect()
File "/usr/lib64/python2.6/httplib.py", line 720, in connect
self.timeout)
File "/usr/lib64/python2.6/socket.py", line 567, in create_connection
raise error, msg
error: [Errno 111] Connection refused
[03/Aug/2014 00:06:00 +0000] 1718 MonitorDaemon-Reporter throttling_logger ERROR (10 skipped) Error sending messages to firehose: mgmt-HOSTMONITOR-f84ed02fa45233b5b3c7d24e567ca229
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/src/cmf/monitor/firehose.py", line 71, in _send
self._port)
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 464, in __init__
self.conn.connect()
File "/usr/lib64/python2.6/httplib.py", line 720, in connect
self.timeout)
File "/usr/lib64/python2.6/socket.py", line 567, in create_connection
raise error, msg
error: [Errno 111] Connection refused
1. Host File
$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.1.1.191 hadoop-test.in.wellcentive.com hadoop-test
2. Host Answer
[geovanie.marquez@hadoop-test ~]$ host -v -t A `hostname`
Trying "hadoop-test.in.wellcentive.com"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36034
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 3
;; QUESTION SECTION:
;hadoop-test.in.wellcentive.com.INA
;; ANSWER SECTION:
hadoop-test.in.wellcentive.com.3600 INA10.1.1.191
;; AUTHORITY SECTION:
in.wellcentive.com.3600INNSx.in.wellcentive.com.
in.wellcentive.com.3600INNSy.in.wellcentive.com.
in.wellcentive.com.3600INNSdz.in.wellcentive.com.
;; ADDITIONAL SECTION:
x.in.wellcentive.com. 3600 INA10.1.1.xxx
y.in.wellcentive.com.3600INA192.168.xxx.xx
z.in.wellcentive.com. 3600INA10.1.1.xxx
Received 171 bytes from 10.1.1.xxx#53 in 0 ms
Any Ideas?
Created on 08-05-2014 07:18 AM - edited 08-05-2014 07:25 AM
The problem was that the following call: (Found in the error log of the installation UI, check out the original question)
python -c 'import socket; import sys; s = socket.socket(socket.AF_INET); s.settimeout(5.0); s.connect((sys.argv[1], int(sys.argv[2]))); s.close();' hadooop-test.in.wellcentive.com 7182
was calling hadooop (threee o's) instead of the name of the server hadoop (two o's)
I checked with my systems team and there was a duplicate entry in the dns with the three o's. Fixed and that was teh problem.
Created 08-03-2014 11:22 AM
arrrgh I am looking into why the error from CM has hadooop instead of hadoop... (blushing)
Created on 08-05-2014 05:44 AM - edited 08-05-2014 05:47 AM
1) Disable IPv6, firewalls, selinux, dns look up is proper
2) Ensure you have sufficient RAM, cores on the CM machines
3) Proper version of java for your CM
4) CM, DB are well connected ??
Please provide all the above details
Just increase java memory for CM
$ sudo vi /etc/default/cloudera-scm-server
Created 08-05-2014 06:16 AM
When following our installation documentation, what step did you reach before discovering things were failing? The requirements section of the installation guide is important to review, espectially the networking & security section.
You should not be laying down all the RPMs, and repo should not be installed manually.
Generally speaking you install 1 RPM package,
yum install cloudera-manager-server-db-2
It will install its dependancies
From there you start the embedded DB (service cloudera-scm-server-db start)
Once it completes self-configuration of the DB you should be able to start CM server
service cloudera-scm-server start
Once you see "Jetty Started" in the logging under /var/log/cloudera-scm-server/cloudera-scm-server.log you can connect to the server.
From there it will guide you through adding hosts, which will install agent and required JDK. Once that is complete you can start parcel deployment of CM.
The "Administration Guide" has the proper steps to "Uninstall", specifically what to do in a failed install attempt. You need to go back through that and verify you cleaned up properly before attempting re-install.
Todd
Created 08-05-2014 07:16 AM
This was an upgrade not a first time install, but I followed the upgrade instructions here: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Administr...
It was a problem with my dns set up.
Created on 08-05-2014 07:18 AM - edited 08-05-2014 07:25 AM
The problem was that the following call: (Found in the error log of the installation UI, check out the original question)
python -c 'import socket; import sys; s = socket.socket(socket.AF_INET); s.settimeout(5.0); s.connect((sys.argv[1], int(sys.argv[2]))); s.close();' hadooop-test.in.wellcentive.com 7182
was calling hadooop (threee o's) instead of the name of the server hadoop (two o's)
I checked with my systems team and there was a duplicate entry in the dns with the three o's. Fixed and that was teh problem.