Created 09-20-2018 02:02 AM
We are trying to install Cloudera Manager and CDH on our cluster, but unfortunately face some errors.
The Error-Log of the agent is:
>>[20/Sep/2018 10:25:06 +0000] 9377 MainThread agent ERROR Heartbeating to node001:7182 failed.
>>Traceback (most recent call last):
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/agent.py", line 1371, in _send_heartbeat
>> response = self.requestor.request('heartbeat', heartbeat_data)
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 141, in request
>> return self.issue_request(call_request, message_name, request_datum)
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 254, in issue_request
>> call_response = self.transceiver.transceive(call_request)
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 483, in transceive
>> result = self.read_framed_message()
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 491, in read_framed_message
>> framed_message = response_reader.read_framed_message()
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 411, in read_framed_message
>> buffer_length = self._read_buffer_length()
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 424, in _read_buffer_length
>> raise ConnectionClosedException("Reader read 0 bytes.")
>>ConnectionClosedException: Reader read 0 bytes.
I have checked the /etc/hosts and everything that was mentioned in similar cases. Nothing helped, I still get no heartbeat from the nodes.
Do you have any clue what I could do next?
Thanks.
Created 09-20-2018 02:24 AM
have you verified if 7182 port is open using telnet
Created 09-20-2018 02:31 AM
yes it is open:
netstat -taupen | grep 7182
tcp 0 0 0.0.0.0:7182 0.0.0.0:* LISTEN 899 19129 1038/java
I can connect with telnet:
telnet node001 7182
Trying 192.168.193.1...
Connected to node001.
Escape character is '^]'.
Created 09-20-2018 02:43 AM
Could you verify the troublshooting steps mentioned in below link and see if you need to fix anything at your end
Created 09-20-2018 03:43 AM
I already followed these instructions
1. IP Address misconfiguration:
$ ifconfig -a
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.193.1 netmask 255.255.248.0 broadcast 192.168.199.255
inet6 fe80::a6bf:1ff:fe06:6539 prefixlen 64 scopeid 0x20<link>
ether a4:bf:01:06:65:39 txqueuelen 1000 (Ethernet)
RX packets 420251 bytes 120871170 (115.2 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 501994 bytes 246473312 (235.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0x95920000-9593ffff
eth1: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether a4:bf:01:06:65:3a txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0x95900000-9591ffff
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520
inet 192.168.209.1 netmask 255.255.248.0 broadcast 192.168.215.255
inet6 fe80::211:7501:178:fc93 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 7788 bytes 1968252 (1.8 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 14515 bytes 2118284 (2.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Lokale Schleife)
RX packets 569168 bytes 252684382 (240.9 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 569168 bytes 252684382 (240.9 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
2. Firewalls are disabled
iptables --list
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
3. DNS is misconfigured
$ nslookup node001
Server: 192.168.192.1
Address: 192.168.192.1#53
Name: node001.ara
Address: 192.168.193.1
Created 09-20-2018 09:44 AM
Since the basics are covered, I'll say that the stack trace you provided looks pretty odd and indicates that the agent was reading a reply from Cloudera Manager but before it could complete, the connection went away...
The Error-Log of the agent is:
>>[20/Sep/2018 10:25:06 +0000] 9377 MainThread agent ERROR Heartbeating to node001:7182 failed.
>>Traceback (most recent call last):
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/agent.py", line 1371, in _send_heartbeat
>> response = self.requestor.request('heartbeat', heartbeat_data)
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 141, in request
>> return self.issue_request(call_request, message_name, request_datum)
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 254, in issue_request
>> call_response = self.transceiver.transceive(call_request)
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 483, in transceive
>> result = self.read_framed_message()
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 491, in read_framed_message
>> framed_message = response_reader.read_framed_message()
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 411, in read_framed_message
>> buffer_length = self._read_buffer_length()
>> File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 424, in _read_buffer_length
>> raise ConnectionClosedException("Reader read 0 bytes.")
>>ConnectionClosedException: Reader read 0 bytes.
We see in the code that this means that no bytes were received:
421 def _read_buffer_length(self):
422 read = self.reader.read(BUFFER_HEADER_LENGTH)
423 if read == '':
424 raise ConnectionClosedException("Reader read 0 bytes.")
This does not appear to be a TCP problem, so I would assert that we will likely find some more information on the Cloudera Manager side.
Please check:
/var/log/cloudera-scm-server/cloudera-scm-server.log
See if you find messages regarding that host or regarding a problem processing heartbeats.
Since this is a "clean" problem, I suspect that Cloudera Manager may not be accepting the heartbeat and should hopefully tell you why.
*** NOTE: If Cloudera doesn't show any information, check to see what server is listening on port 7182 just to make sure it is really CM:
# netstat -nap |grep 7182 |grep LISTEN
(note the pid)
# ps aux |grep <pid>
For example:
# netstat -nap |grep 7182|grep LISTEN
tcp 0 0 0.0.0.0:7182 0.0.0.0:* LISTEN 28669/java
# ps aux |grep 28669 |grep "cmf.Main"
This should return one result which is the CM process.
Created 09-21-2018 01:00 AM
These are the last bits of the server log
2018-09-21 09:53:47,602 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Reaped total of 0 deleted commands
2018-09-21 09:53:47,604 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Found no commands older than 2016-09-21T07:53:47.603Z to reap.
2018-09-21 09:53:47,604 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Wizard is active, not reaping scanners or configurators
2018-09-21 09:54:04,321 INFO avro-servlet-hb-processor-13:com.cloudera.server.common.AgentAvroServlet: (11 skipped) AgentAvroServlet: heartbeat processing stats: average=0ms, min=0ms, max=16ms.
2018-09-21 09:54:50,235 INFO ScmActive-0:com.cloudera.server.cmf.components.ScmActive: (119 skipped) ScmActive completed successfully.
2018-09-21 09:55:04,366 INFO avro-servlet-hb-processor-1:com.cloudera.server.common.AgentAvroServlet: (11 skipped) AgentAvroServlet: heartbeat processing stats: average=0ms, min=0ms, max=16
ms.
2018-09-21 09:55:19,306 INFO agentServer-316:com.cloudera.server.common.MonitoringThreadPool: agentServer: execution stats: average=1125ms, min=0ms, max=5012ms.
2018-09-21 09:55:19,307 INFO agentServer-316:com.cloudera.server.common.MonitoringThreadPool: agentServer: waiting in queue stats: average=0ms, min=0ms, max=8ms.
2018-09-21 09:56:04,417 INFO avro-servlet-hb-processor-13:com.cloudera.server.common.AgentAvroServlet: (11 skipped) AgentAvroServlet: heartbeat processing stats: average=0ms, min=0ms, max=16ms.
2018-09-21 09:57:04,472 INFO avro-servlet-hb-processor-1:com.cloudera.server.common.AgentAvroServlet: (11 skipped) AgentAvroServlet: heartbeat processing stats: average=0ms, min=0ms, max=16
ms.
The server seems to skip the heartbeats and I don't know how I can see why. Any clue?
Created 09-21-2018 12:38 PM
I think that "skipped" happens because the logging uses throttled logging (only 1 of many such lines are printed.
Are all your agents having trouble heartbeating or just one or two?
Maybe take a screen shot of your Hosts tab in CM to give us an idea. That log snippet doesn't tell me much other than the fact that some clients have sent a heartbeat at some point since CM was started.
Created 02-04-2019 03:18 AM
Faced same issue.
Turned out that it's due too enabled AutoTLS, and it's feature of enterprise version only.
it's not obvious from setup tutorial.