We are randomly getting "The host's NTP service is not synchronized to any remote server."
These alerts then are being fixed by themselves, and the cluster reports that "The health test result for HOST_CLOCK_OFFSET has become good"
We are using chronyd.
When I run 'chronyc sources' , I see two NTP servers listed, with ^* in front of the first one.
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* ipsap01.ecb.de 2 10 377 371 +40us[ +85us] +/- 8566us
^+ ipsap02.ecb.de 3 8 377 58 +31us[ +31us] +/- 14ms
The only thing I found in cloudera-scm-agent.log file was this:
[root cloudera-scm-agent]# catcloudera-scm-agent.log.1 | grep chronyc
[13/Nov/2019 14:35:26 +0000] 58865 Monitor-HostMonitor throttling_logger ERROR Timeout with args ['chronyc', 'sources']
Exception: timeout with args ['chronyc', 'sources']
[13/Nov/2019 19:15:24 +0000] 58865 Monitor-HostMonitor throttling_logger ERROR chronyc: chronyc sources: not synchronized to any server
[14/Nov/2019 08:19:25 +0000] 58865 Monitor-HostMonitor throttling_logger ERROR (11 skipped) chronyc: chronyc sources: not synchronized to any server
[14/Nov/2019 12:36:26 +0000] 58865 Monitor-HostMonitor throttling_logger ERROR (1 skipped) chronyc: chronyc sources: not synchronized to any server
[14/Nov/2019 14:36:26 +0000] 58865 Monitor-HostMonitor throttling_logger ERROR (3 skipped) chronyc: chronyc sources: not synchronized to any server
[14/Nov/2019 18:17:28 +0000] 58865 Monitor-HostMonitor throttling_logger ERROR chronyc: chronyc sources: not synchronized to any server
[14/Nov/2019 19:03:26 +0000] 58865 Monitor-HostMonitor throttling_logger ERROR (3 skipped) chronyc: chronyc sources: not synchronized to any server
What could be the problem?
Created 11-17-2019 09:55 PM
Log messages show that chrony loses synchronisation frequently.
Are you also able to reach server ipsap01.ecb.de during the time of the issue? The last received values seem to be higher as compared to the other available server.
Please compare the performance with servers configured on other hosts which do not report the issue.
Check with your network/OS team on the server availability and to make time synchronization stable on the hosts
Created 11-17-2019 10:27 PM
Please check this once -
Try running "ntpdate ipsap01.ecb.de" on all hosts and check if any issue reported while running this command
Make sure chronyd/ntp.conf is same on all nodes
hwclock--systohc
systemctl restart cloudera-scm-agent
Further more if the above wont help then you need to debug ntp server side.
Execute below commands -
ntpq -c pe
The output shown is good, but note that if the refid column indicates ".INIT." it can suggest a communication issue.
ntpq -c as
The output below is good however if the reach column indicates "no" it suggests that the client cannot reach peer hosts.
You probably need to check stratum of your ntp servers -
The "assID" from ntpq -c as can be used with command ntpq -c "rv assID" to determine the "stratum". The lower the stratum the better. The upper limit for stratum is 15; stratum 16 is used to indicate that a device is unsynchronized.
ntpq -c "rv <association_id_from_above_command_output>"
Created 11-17-2019 10:29 PM
Probably i see chronyd command are similar to NTP - you can refer this for debugging -
https://www.thegeekdiary.com/centos-rhel-7-tips-on-troubleshooting-ntp-chrony-issues/
Created 11-19-2019 12:44 PM
@MihailK In RHEL7 chronyd is taking preference over the NTP. So it's worth to check if NTP service is running, if yes then disable it. System should use chronyd only.
Secondly agent checks the status of clock in every 2 seconds and read the output of chronyd sources or ntpq. If this does not find * in the output then it marked that instant false and triggers an alert. So you also have to check that NTP servers is in sync every-time and ask with your OS team if they have any drops.
Created 11-28-2019 09:48 AM
@MihailK does this resolved the issue? If yes, please spare some time to mark this as solution. Thanks.