Support Questions

Find answers, ask questions, and share your expertise

Stange lost of HeartBeats Ambari in some hosts

avatar
New Contributor

I lost all heartbeats on some datanodes of my cluster after a restart of machines.

The problem is located just after the connection of ambari-agent of the host with ambari-server

The last log I received in /var/log/ambari-agent/ambari-agent.log file of the defected DataNode :

INFO 2016-04-20 17:59:48,925 PingPortListener.py:50 - Ping port listener started on port: 8670 INFO 2016-04-20 17:59:48,927 main.py:283 - Connecting to Ambari server at https://hmaster1.xxx.local:8440 (10.10.238.111)

NetUtil.py:60 - Connecting to https://hmaster1.xxx.local:8440/ca

With the working Datanodes the process continue with this line of log :

INFO 2016-04-20 17:51:22,147 threadpool.py:52 - Started thread pool with 3 core threads and 20 maximum threads

In the Log of the ambari-server located in /var/log/ambari-server/ambari-server.log file. I see anything between the defected DataNode and Ambari Master.

I notice that I use the last version of ambari 2.2.1.1 and centos 7 with the last updates.

I disabled all firewall rules and I have the same configuration for the working dataNode and the defected one.

Any idea about this strange problem ?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

first check, whether these datanodes are reachable from ambari-server using ssh protocol and their hostnames. And also try to do vice-versa then telnet from datanode to ambari server using ambari-server hostname on port 8440. If everything looks good. Then kill the current ambari-agent daemon and restart this service. Please make sure there is no hung stale instance of Ambani-Agent is running.

If it does not work then stop Ambari server. Stop postgresql DB server

Now Start Ambari-Server and it will start postgreSQL server itself.

Let me know if it does not fix the issue.

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

first check, whether these datanodes are reachable from ambari-server using ssh protocol and their hostnames. And also try to do vice-versa then telnet from datanode to ambari server using ambari-server hostname on port 8440. If everything looks good. Then kill the current ambari-agent daemon and restart this service. Please make sure there is no hung stale instance of Ambani-Agent is running.

If it does not work then stop Ambari server. Stop postgresql DB server

Now Start Ambari-Server and it will start postgreSQL server itself.

Let me know if it does not fix the issue.

avatar
New Contributor

It was an ssh problem between machines. Thank you.

avatar
Expert Contributor
@K. Karray
  1. Kill any stale amabri-agent on effected nodes. ( ps -ef|grep ambari-agent)
  2. Restart the ambari-agent manually. (sudo systemctl start ambari-agent)
  3. If issue persists share the amabri-agent logs

avatar

We need to make sure below point in heart beat lost host.

  • check firewall status. It should be stop.

service iptables stop

  • check /etc/ambari-agent/conf/ambari-agent.ini file.

in ambari-agent file hostname entry should be

hostname = ambariservernodehost

ambariservernodehost should be present in /etc/hosts file

  • openssl version should be upgraded.
  • Stop ambari-server
  • Stop ambari-agent service on all nodes
  • Start ambari-agent service on all nodes
  • Start ambari-server server

check logs of ambari agent. If even there is problem then please reply me.