Support Questions

Find answers, ask questions, and share your expertise

Cloudera Manager Server - "Name or service not known"

avatar
Explorer

I am running out of things to try here to finalize the upgrade on the cloudera manager agents. I finally gave up trying to fix it and reinstalled the whole cloudera manager service, to no avail.

 

$ rpm -qa 'cloudera-*'

cloudera-manager-agent-5.1.1-1.cm511.p0.82.el6.x86_64

cloudera-manager-server-db-2-5.1.1-1.cm511.p0.82.el6.x86_64

cloudera-manager-repository-5.0-1.noarch #Does this alarm anyone? I uninstalled the repository and reinstalled hoping to get 5.1 version repo but it installed this one again. 

cloudera-manager-server-5.1.1-1.cm511.p0.82.el6.x86_64

cloudera-manager-daemons-5.1.1-1.cm511.p0.82.el6.x86_64

 

$ java -version

java version "1.7.0_51"

Java(TM) SE Runtime Environment (build 1.7.0_51-b13)

Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

 

The error happens once I log back into the cloudera manager service to upgrade the cm agents, it fails pretty fast (in a couple of seconds):

Detecting Cloudera Manager Server...

BEGIN host -t PTR 10.1.1.191 
191.1.1.10.in-addr.arpa domain name pointer hadooop-test.in.wellcentive.com. 
END (0) 
using hadooop-test.in.wellcentive.com as scm server hostname 
BEGIN which python 
/usr/bin/python 
END (0) 
BEGIN python -c 'import socket; import sys; s = socket.socket(socket.AF_INET); s.settimeout(5.0); s.connect((sys.argv[1], int(sys.argv[2]))); s.close();' hadooop-test.in.wellcentive.com 7182 
Traceback (most recent call last): 
File "<string>", line 1, in <module> 
File "<string>", line 1, in connect 
socket.gaierror: [Errno -2] Name or service not known 
END (1) 
could not contact scm server at hadooop-test.in.wellcentive.com:7182, giving up 
waiting for rollback request 

 

I tried to replicate that part of the code on the terminal:

$ python -c 'import socket; import sys; s = socket.socket(socket.AF_INET); s.settimeout(5.0); s.connect(("hadoop-test.in.wellcentive.com", int(7182))); s.close();'

#the line above returns nothing, which indicates success. This is proven by teh code below where I try a fake name in for the hostname.

 

The following is a test to see if I could get it to give me an error.

$ python -c 'import socket; import sys; s = socket.socket(socket.AF_INET); s.settimeout(5.0); s.connect(("fakehadoop-test.in.wellcentive.com", int(7182))); s.close();'

Traceback (most recent call last):

  File "<string>", line 1, in <module>

  File "<string>", line 1, in connect

socket.gaierror: [Errno -2] Name or service not known

 

This is what I see in the logs:

 

Nothing is revealed in the cloudera-scm-server.log:

2014-08-03 11:41:52,134  INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from EXECUTE_SCRIPT (PT1.043S) to SCRIPT_START

2014-08-03 11:41:52,134  INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from SCRIPT_START (PT0S) to TAKE_LOCK

2014-08-03 11:41:52,135  INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from TAKE_LOCK (PT0.001S) to DETECT_ROOT

2014-08-03 11:41:52,135  INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from DETECT_ROOT (PT0S) to DETECT_DISTRO

2014-08-03 11:41:52,135  INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from DETECT_DISTRO (PT0S) to DETECT_SCM

2014-08-03 11:41:52,135  INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@499] hadoop-test.in.wellcentive.com: New state is a backward state. Storing failed state

2014-08-03 11:41:52,135  INFO [NodeConfiguratorThread-0-0:node.NodeConfiguratorProgress@534] hadoop-test.in.wellcentive.com: Transitioning from DETECT_SCM (PT0S) to WAITING_FOR_ROLLBACK

 

cloudera-scm-agent.log shows a similar error to the UI output

[02/Aug/2014 23:55:00 +0000] 1718 MonitorDaemon-Reporter throttling_logger ERROR    Error sending messages to firehose: mgmt-HOSTMONITOR-f84ed02fa45233b5b3c7d24e567ca229

Traceback (most recent call last):

  File "/usr/lib64/cmf/agent/src/cmf/monitor/firehose.py", line 71, in _send

    self._port)

  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 464, in __init__

    self.conn.connect()

  File "/usr/lib64/python2.6/httplib.py", line 720, in connect

    self.timeout)

  File "/usr/lib64/python2.6/socket.py", line 567, in create_connection

    raise error, msg

error: [Errno 111] Connection refused

[03/Aug/2014 00:06:00 +0000] 1718 MonitorDaemon-Reporter throttling_logger ERROR    (10 skipped) Error sending messages to firehose: mgmt-HOSTMONITOR-f84ed02fa45233b5b3c7d24e567ca229

Traceback (most recent call last):

  File "/usr/lib64/cmf/agent/src/cmf/monitor/firehose.py", line 71, in _send

    self._port)

  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 464, in __init__

    self.conn.connect()

  File "/usr/lib64/python2.6/httplib.py", line 720, in connect

    self.timeout)

  File "/usr/lib64/python2.6/socket.py", line 567, in create_connection

    raise error, msg

error: [Errno 111] Connection refused

 

1. Host File

$ cat /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

 

10.1.1.191 hadoop-test.in.wellcentive.com hadoop-test

 

2. Host Answer

[geovanie.marquez@hadoop-test ~]$ host -v -t A `hostname`

Trying "hadoop-test.in.wellcentive.com"

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36034

;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 3

 

;; QUESTION SECTION:

;hadoop-test.in.wellcentive.com.INA

 

;; ANSWER SECTION:

hadoop-test.in.wellcentive.com.3600 INA10.1.1.191

 

;; AUTHORITY SECTION:

in.wellcentive.com.3600INNSx.in.wellcentive.com.

in.wellcentive.com.3600INNSy.in.wellcentive.com.

in.wellcentive.com.3600INNSdz.in.wellcentive.com.

 

;; ADDITIONAL SECTION:

x.in.wellcentive.com. 3600 INA10.1.1.xxx

y.in.wellcentive.com.3600INA192.168.xxx.xx

z.in.wellcentive.com. 3600INA10.1.1.xxx

 

Received 171 bytes from 10.1.1.xxx#53 in 0 ms

 

 

Any Ideas?

 

1 ACCEPTED SOLUTION

avatar
Explorer

The problem was that the following call: (Found in the error log of the installation UI, check out the original question)

 

python -c 'import socket; import sys; s = socket.socket(socket.AF_INET); s.settimeout(5.0); s.connect((sys.argv[1], int(sys.argv[2]))); s.close();' hadooop-test.in.wellcentive.com 7182 

 

was calling hadooop (threee o's) instead of the name of the server hadoop (two o's)

 

I checked with my systems team and there was a duplicate entry in the dns with the three o's. Fixed and that was teh problem. 

View solution in original post

5 REPLIES 5

avatar
Explorer

arrrgh I am looking into why the error from CM has hadooop instead of hadoop... (blushing)

avatar
Explorer

1) Disable IPv6, firewalls, selinux, dns look up is proper

2) Ensure you have sufficient RAM, cores on the CM machines

3) Proper version of java for your CM

4) CM, DB are well connected ??

 

Please provide all the above details

 

Just increase java memory for CM

$ sudo vi /etc/default/cloudera-scm-server

avatar
Master Collaborator

When following our installation documentation, what step did you reach before discovering things were failing?  The requirements section of the installation guide is important to review, espectially the networking & security section.

 

http://docs.cloudera.com

 

You should not be laying down all the RPMs, and repo should not be installed manually.

 

Generally speaking you install 1 RPM package,

 

yum install cloudera-manager-server-db-2

 

It will install its dependancies

 

From there you start the embedded DB (service cloudera-scm-server-db start)

 

Once it completes self-configuration of the DB you should be able to start CM server

 

service cloudera-scm-server start

 

Once you see "Jetty Started" in the logging under /var/log/cloudera-scm-server/cloudera-scm-server.log you can connect to the server.

 

http://your.host.fqdn:7180

 

From there it will guide you through adding hosts, which will install agent and required JDK. Once that is complete you can start parcel deployment of CM.

 

The "Administration Guide" has the proper steps to "Uninstall", specifically what to do in a failed install attempt.  You need to go back through that and verify you cleaned up properly before attempting re-install.

 

Todd

 

avatar
Explorer

This was an upgrade not a first time install, but I followed the upgrade instructions here: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Administr...

 

It was a problem with my dns set up.

avatar
Explorer

The problem was that the following call: (Found in the error log of the installation UI, check out the original question)

 

python -c 'import socket; import sys; s = socket.socket(socket.AF_INET); s.settimeout(5.0); s.connect((sys.argv[1], int(sys.argv[2]))); s.close();' hadooop-test.in.wellcentive.com 7182 

 

was calling hadooop (threee o's) instead of the name of the server hadoop (two o's)

 

I checked with my systems team and there was a duplicate entry in the dns with the three o's. Fixed and that was teh problem.