I have never seen this before upgrading to 2.5.1
A cluster of 6 nodes, after 2 weeks running, one of the hosts is listed as having lost the heartbeat. The agent is reporting metrics and all the components are running fine without alerts. It's only that the node actions are disabled.
However looking at the agents log, the heartbeat seems to be running normally and continuously, e.g:
INFO 2017-07-25 01:55:31,110 Controller.py:304 - Heartbeat (response id = 621881) with server is running... INFO 2017-07-25 01:55:31,110 Controller.py:311 - Building heartbeat message INFO 2017-07-25 01:55:31,112 Heartbeat.py:90 - Adding host info/state to heartbeat message. INFO 2017-07-25 01:55:31,163 logger.py:75 - Testing the JVM's JCE policy to see it if supports an unlimited key length. INFO 2017-07-25 01:55:31,163 logger.py:75 - Testing the JVM's JCE policy to see it if supports an unlimited key length. INFO 2017-07-25 01:55:31,289 Hardware.py:176 - Some mount points were ignored: /, /dev, /dev/shm, /sys/fs/cgroup, /run, /boot, /var/log, /hadoop, /hadoop/druid, /hadoop/yarn/local, /run/user/1017, /run/user/1006, /run/user/1002, /run/user/1003 INFO 2017-07-25 01:55:31,291 Controller.py:320 - Sending Heartbeat (id = 621881) INFO 2017-07-25 01:55:31,335 Controller.py:332 - Heartbeat response received (id = 621882) INFO 2017-07-25 01:55:31,336 Controller.py:341 - Heartbeat interval is 1 seconds INFO 2017-07-25 01:55:31,336 Controller.py:377 - Updating configurations from heartbeat INFO 2017-07-25 01:55:31,336 Controller.py:386 - Adding cancel/execution commands INFO 2017-07-25 01:55:31,336 Controller.py:471 - Waiting 0.9 for next heartbeat INFO 2017-07-25 01:55:32,236 Controller.py:478 - Wait for next heartbeat over
Both server and agent are clean installations and all the ambari packages are on the same version: 18.104.22.168-159
Restarting the server solved it without having to restart the agent.
Has anybody seen this behavior? So far has only happened once but I have only recently started using 2.5.1
Since the server started and finished startup, there was only one message printed every 5 minutes:
25 Jul 2017 14:12:22,785 INFO [pool-18-thread-1] MetricsServiceImpl:64 - Checking for metrics sink initialization
However, since I restarted now is gone and so far the heartbeat is fine. Maybe is a coincidence. I'm thinking, I don't know when the heartbeat was lost since it's only noticeable if you go to the hosts or the specific host screen, it doesn't show in the main screen because the services are fine