Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Cluster installation failing (Heartbeat)

New Contributor

I have three CentOS7 VMs created in a Xenserver install, and all three are failing at the "Add Cluster - Installation" with the below error.

 

The VMs:

master.mycluster.com

worker.mycluster.com

utility.mycluster.com

 

Cloudera manager is running on the utility node.

 

Below is the log failure, failure to heartbeat

 

>>[04/Nov/2018 21:34:03 +0000] 10805 MainThread agent        ERROR    Heartbeating to utility.mycluster.com:7182 failed.
>>Traceback (most recent call last):
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/agent.py", line 1371, in _send_heartbeat
>> response = self.requestor.request('heartbeat', heartbeat_data)
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 141, in request
>> return self.issue_request(call_request, message_name, request_datum)
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 254, in issue_request
>> call_response = self.transceiver.transceive(call_request)
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 483, in transceive
>> result = self.read_framed_message()
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 489, in read_framed_message
>> framed_message = response_reader.read_framed_message()
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 417, in read_framed_message
>> raise ConnectionClosedException("Reader read 0 bytes.")
>>ConnectionClosedException: Reader read 0 bytes. 

 

I have been following the Cloudera install documentation for CM/CDH 6.0.1

 

/etc/hosts pasted below for each VM:

 

worker.mycluster.com:

#127.0.0.1       localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1             localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.36 utility.mycluster.com utility 192.168.0.35 worker.mycluster.com worker 192.168.0.34 master.mycluster.com master

 

 

utility.mycluster.com:

# note that localhost isn't commented out. Postgres service would fail to start if it was
127.0.0.1       localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1             localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.36 utility.mycluster.com utility 192.168.0.35 worker.mycluster.com worker 192.168.0.34 master.mycluster.com master

 

master.mycluster.com

#127.0.0.1       localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1       localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.36 utility.mycluster.com utility 192.168.0.35 worker.mycluster.com worker 192.168.0.34 master.mycluster.com master

 

 

I just installed bind on only worker.mycluster.com to see if it would fix the heartbeat issue for just that node, but it did not. Same error. See below on proof DNS is setup correctly for worker.mycluster.com

>>> hostname -f
worker.mycluster.com

>>> host -v -t $(hostname)
Trying "worker.mycluster.com"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8235
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;worker.mycluster.com.            IN    A 

;; ANSWER SECTION:
;worker.mycluster.com.    604800  IN    A    192.168.0.35

;; AUTHORITY SECTION:
;mycluster.com.    604800  IN    NS    worker.mycluster.com.

>>> nslookup utility.mycluster.com #Forward
server:     127.0.0.1
address:  127.0.0.1#53

Name: utility.mycluster.com
Address: 192.168.0.36

>>> nslookup 192.168.0.36 #Reverse
server:     127.0.0.1
address:  127.0.0.1#53

36.0.168.192.in-addr.arpa name = utility.mycluster.com.

 

2 REPLIES 2

New Contributor

Hi,

It looks more like a networking problem.

Can you reach utility.mycluster.com:7182 from all the servers ?

Have you tryed disabling the firewall on the servers ?

Regards.

New Contributor

Hello,

 

All servers have firewalld service stopped and disabled.

 

I ran the following from worker.mycluster.com, and it connects fine

telnet 192.168.0.36 7182
telnet utility.mycluster.com 7182

 Also connects fine

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.