Support Questions
Find answers, ask questions, and share your expertise

Ambari 2.5.1 server and agent disagree on heartbeat

I have never seen this before upgrading to 2.5.1

A cluster of 6 nodes, after 2 weeks running, one of the hosts is listed as having lost the heartbeat. The agent is reporting metrics and all the components are running fine without alerts. It's only that the node actions are disabled.

However looking at the agents log, the heartbeat seems to be running normally and continuously, e.g:

INFO 2017-07-25 01:55:31,110 - Heartbeat (response id = 621881) with server is running...
INFO 2017-07-25 01:55:31,110 - Building heartbeat message
INFO 2017-07-25 01:55:31,112 - Adding host info/state to heartbeat message.
INFO 2017-07-25 01:55:31,163 - Testing the JVM's JCE policy to see it if supports an unlimited key length.
INFO 2017-07-25 01:55:31,163 - Testing the JVM's JCE policy to see it if supports an unlimited key length.
INFO 2017-07-25 01:55:31,289 - Some mount points were ignored: /, /dev, /dev/shm, /sys/fs/cgroup, /run, /boot, /var/log, /hadoop, /hadoop/druid, /hadoop/yarn/local, /run/user/1017, /run/user/1006, /run/user/1002, /run/user/1003
INFO 2017-07-25 01:55:31,291 - Sending Heartbeat (id = 621881)
INFO 2017-07-25 01:55:31,335 - Heartbeat response received (id = 621882)
INFO 2017-07-25 01:55:31,336 - Heartbeat interval is 1 seconds
INFO 2017-07-25 01:55:31,336 - Updating configurations from heartbeat
INFO 2017-07-25 01:55:31,336 - Adding cancel/execution commands
INFO 2017-07-25 01:55:31,336 - Waiting 0.9 for next heartbeat
INFO 2017-07-25 01:55:32,236 - Wait for next heartbeat over

Both server and agent are clean installations and all the ambari packages are on the same version:

Restarting the server solved it without having to restart the agent.

Has anybody seen this behavior? So far has only happened once but I have only recently started using 2.5.1


@Gonzalo Herreros

Did Ambari server logs report any issues at the time of lost heartbeats?

Since the server started and finished startup, there was only one message printed every 5 minutes:

25 Jul 2017 14:12:22,785  INFO [pool-18-thread-1] MetricsServiceImpl:64 - Checking for metrics sink initialization

However, since I restarted now is gone and so far the heartbeat is fine. Maybe is a coincidence. I'm thinking, I don't know when the heartbeat was lost since it's only noticeable if you go to the hosts or the specific host screen, it doesn't show in the main screen because the services are fine