Created 04-20-2016 06:15 PM
I lost all heartbeats on some datanodes of my cluster after a restart of machines.
The problem is located just after the connection of ambari-agent of the host with ambari-server
The last log I received in /var/log/ambari-agent/ambari-agent.log file of the defected DataNode :
INFO 2016-04-20 17:59:48,925 PingPortListener.py:50 - Ping port listener started on port: 8670 INFO 2016-04-20 17:59:48,927 main.py:283 - Connecting to Ambari server at https://hmaster1.xxx.local:8440 (10.10.238.111)
NetUtil.py:60 - Connecting to https://hmaster1.xxx.local:8440/ca
With the working Datanodes the process continue with this line of log :
INFO 2016-04-20 17:51:22,147 threadpool.py:52 - Started thread pool with 3 core threads and 20 maximum threads
In the Log of the ambari-server located in /var/log/ambari-server/ambari-server.log file. I see anything between the defected DataNode and Ambari Master.
I notice that I use the last version of ambari 2.2.1.1 and centos 7 with the last updates.
I disabled all firewall rules and I have the same configuration for the working dataNode and the defected one.
Any idea about this strange problem ?
Created 04-20-2016 07:20 PM
first check, whether these datanodes are reachable from ambari-server using ssh protocol and their hostnames. And also try to do vice-versa then telnet from datanode to ambari server using ambari-server hostname on port 8440. If everything looks good. Then kill the current ambari-agent daemon and restart this service. Please make sure there is no hung stale instance of Ambani-Agent is running.
If it does not work then stop Ambari server. Stop postgresql DB server
Now Start Ambari-Server and it will start postgreSQL server itself.
Let me know if it does not fix the issue.
Created 04-20-2016 07:20 PM
first check, whether these datanodes are reachable from ambari-server using ssh protocol and their hostnames. And also try to do vice-versa then telnet from datanode to ambari server using ambari-server hostname on port 8440. If everything looks good. Then kill the current ambari-agent daemon and restart this service. Please make sure there is no hung stale instance of Ambani-Agent is running.
If it does not work then stop Ambari server. Stop postgresql DB server
Now Start Ambari-Server and it will start postgreSQL server itself.
Let me know if it does not fix the issue.
Created 04-21-2016 12:35 PM
It was an ssh problem between machines. Thank you.
Created 04-20-2016 10:33 PM
Created 04-30-2017 05:52 PM
We need to make sure below point in heart beat lost host.
service iptables stop
in ambari-agent file hostname entry should be
hostname = ambariservernodehost
ambariservernodehost should be present in /etc/hosts file
check logs of ambari agent. If even there is problem then please reply me.