Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Heartbeat lost[Ambari-agent]

avatar
Contributor

Hi,
Newbie here. Suddenly one of the nodes lost the heartbeat. Tried to restart ambari-agent and ambari-server. However, the error still persists. Here is the ambari-agent log.

 

WARNING 2020-02-11 15:24:08,318 base_alert.py:138 - [Alert][ranger_admin_password_check] Unable to execute alert. argument of type 'NoneType' is not iterable
INFO 2020-02-11 15:24:14,721 security.py:141 - Encountered communication error. Details: SSLError('The read operation timed out',)
ERROR 2020-02-11 15:24:14,721 Controller.py:226 - Unable to connect to: https://xxx1:8441/agent/v1/register/xxx2.com
Traceback (most recent call last):
File "/usr/lib/ambari-agent/lib/ambari_agent/Controller.py", line 175, in registerWithServer
ret = self.sendRequest(self.registerUrl, data)
File "/usr/lib/ambari-agent/lib/ambari_agent/Controller.py", line 549, in sendRequest
raise IOError('Request to {0} failed due to {1}'.format(url, str(exception)))
IOError: Request to https://xxx1.com:8441/agent/v1/register/xxx2.com failed due to Error occured during connecting to the server: ('The read operation timed out',)
ERROR 2020-02-11 15:24:14,721 Controller.py:227 - Error:Request to https://xxx1.com:8441/agent/v1/register/xxx2.com failed due to Error occurred during connecting to the server: ('The read operation timed out',)

 

Note: Able to telnet manually port 8440 and 8441. All ports are listening also.

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Contributor

Hi @lwang @jsensharma 

 

Thank you for the useful information that you've provided.

 

After doing some testing. I found  out that there an issue with one of the network interfaces on the servers. By testing the jumbo frame connectivity. We remove the defective module and heartbeat lost has been resolved. Thank you for your assistance guys!.

View solution in original post

7 REPLIES 7

avatar
Guru

Hi @TR7_BRYLE ,

 

What is your Ambari version? You may want to check this knowledge article:

https://my.cloudera.com/knowledge/ERROR-quot-Request-to-https-AMBARI-SERVER-8441-agent-v1?id=273271

 

In case you can not access above, here are some details:

Cause:

This issue occurs when ethernet card or the switch does not support Jumbo frame, but the Jumbo frame (MTUSIZE=9000) is set in the network configuration.

To verify if the Jumbo frame is enabled, check the content of network interface configuration by running the following:

cat /etc/sysconfig/network-scripts/ifcfg-eth#


The Jumbo frame is enabled, if the following content (in bold) is displayed:

TYPE=Ethernet 
DEVICE=eth0 
ONBOOT=yes 
BOOTPROTO=static 
IPADDR=xxx.xxx.xxx.xxx
NETMASK=xxx.xxx.xxx.xxx
MTUSIZE=9000


Instructions:
To resolve this issue, do the following for each node with the issue:

1. From /etc/sysconfig/network-scripts/ifcfg-eth#1, remove the following:

MTUSIZE=9000


2. Restart the network:

/etc/initd/network restart


3. Restart the ambari-agent:

ambari-agent restart

 

Thanks and hope this helps!

Li Wang, Technical Solution Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Terms of Service

Community Guidelines

How to use the forum

avatar
Contributor

Hi @lwang .

 

Thank you for this suggestion. I will perform this troubleshooting.

 

Thanks

avatar
Master Mentor

@TR7_BRYLE 

The error is actually due to timeout (and not because of port access)

SSLError('The read operation timed out',)

 

 

Above error indicates that communication further like reading a response is timing out. So we will have to first check why the "https" request is being timed out.

 

We can try using the following kind of simple Python script to simulate what agent actually tries. Ambari agent is a python utility which tries to connect to ambari server a d tries to register itself and sends heartbeat messages to ambari server.

So we can test the following script from the agent host to see if it is able to connect or if that is also getting timed out. We are using 'httplib' to test the access and Https communication.

# cat /tmp/SSL/ssl_test.py

import httplib
import ssl
if __name__ == "__main__":
     ca_connection = httplib.HTTPSConnection('kerlatest1.example.com:8440', timeout=5, context=ssl._create_unverified_context())
     ca_connection.request("GET", '/connection_info')
     response = ca_connection.getresponse()
     print response.status
     data = response.read()
     print str(data)


Run it like following:

# export PYTHONPATH=/usr/lib/ambari-agent/lib:/usr/lib/ambari-agent/lib/ambari_agent:$PYTHONPATH
# python /tmp/SSL/ssl_test.py

If above works fine and it returns 200 and returns result like following:

# python /tmp/SSL/ssl_test.py 
200
{"security.server.two_way_ssl":"false"}


If you notice any HTTPS communitation or certificat related error then you might want to refer to the following article and according to your Ambari version please check if you have following defined in your ambari-agent.ini file "[security]" section?

[security]
force_https_protocol=PROTOCOL_TLSv1_2



- If you still face any issue then can you please share the "ambari-agent.log" freshly after restarting it ?

 

Reference Article:
Java/Python Updates and Ambari Agent TLS Settings
https://community.cloudera.com/t5/Community-Articles/Java-Python-Updates-and-Ambari-Agent-TLS-Settin...

.

.

 

avatar
Contributor

Hi @jsensharma 

 

Thank you for this suggestion. I will have it try and test the python script.

 

Thanks

avatar
Contributor

Hi @jsensharma .

 

Another thing, I have already declared this on my ambari.ini file.

 

[security]
force_https_protocol=PROTOCOL_TLSv1_2

  Thanks.

avatar
Master Mentor

@TR7_BRYLE 

As requested earlier 
- If you still face any issue then can you please share the "ambari-agent.log" freshly after restarting it ?

avatar
Contributor

Hi @lwang @jsensharma 

 

Thank you for the useful information that you've provided.

 

After doing some testing. I found  out that there an issue with one of the network interfaces on the servers. By testing the jumbo frame connectivity. We remove the defective module and heartbeat lost has been resolved. Thank you for your assistance guys!.