Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

It's time we have a talk about NTP. Why does it keep dying even though I followed the documentation exactly?

Solved Go to solution
Highlighted

It's time we have a talk about NTP. Why does it keep dying even though I followed the documentation exactly?

Explorer

Continuously my NTP service seems to die on some of my VMs but not others, and I cannot seem to figure out why. I've followed every piece of advice I've been given so far to no avail. NTP service is dying without drifting it seems. When I use `ntpdate` to re-sync to the pool, it's never been more than ~50ms offset.

 

yum install ntp -y
systemctl start ntpd
systemctl enable ntpd
ntpdate -u pool.ntp.org
hwclock --systohc



The VMs are hosted across 3 different ESXi servers that are connected via vCenter. The ESXi hosts all have an NTP service running and enabled, and are syncing to the same external NTP server. Is there anything in `ntp.conf` that should be changed from default in order to prevent complications? I have zero clue why ntp service will randomly die on some hosts, yet I have some hosts that have had the service running without issue for 1+ month, but only if CM doesn't have the proper PID for the ntp service. As I'm writing this I think I may have solved it. By default ntp.conf is set to not allow queries. So is Cloudera querying ntp for it's offset to the point that it kills the service as a security measure or something?

from ntp.conf:

# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default nomodify notrap nopeer noquery

Is Cloudera making NTP kill itself by querying it for it's offset? Please update documentation to reflect this if this is the case!

 

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: It's time we have a talk about NTP. Why does it keep dying even though I followed the documentation exactly?

Cloudera Employee

To my knowledge, what the cluster uses to check NPT is the following command:

 

# ntpdc -np

 

This is used in the health checks for the hosts. This is included in the docs under the Host Health Tests. For example in CDH 5.16.x:

 

https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cm_ht_host.html#concept_lxn_zxn_yk

 

You might also consider Chrony over NTP. From a previous discussion I had with a subject matter expert at Red Hat, these are the things chronyd can do better than ntpd:

- chronyd can work well in an environment where access to the time reference is intermittent, whereas ntpd needs regular polling of time reference to work well.
- chronyd can perform well even when the network is congested for longer periods of time.
- chronyd can usually synchronize the clock faster and with better accuracy.
- chronyd quickly adapts to sudden changes in the rate of the clock, for example, due to changes in the temperature of the crystal oscillator, whereas ntpd may need a long time to settle down again.
- In the default configuration, chronyd never steps the time after the clock has been synchronized at system start, in order not to upset other running programs. ntpd can be configured to never step the time too, but it has to use a different means of adjusting the clock, which has some disadvantages including negative effect on accuracy of the clock.
- chronyd can adjust the rate of the clock on a Linux system in a larger range, which allows it to operate even on machines with a broken or unstable clock. For example, on some virtual machines.
- chronyd is smaller, it uses less memory and it wakes up the CPU only when necessary, which is better for power saving.

 

I think the intermittent connection and virtual-machine specific details might apply to your use case.

 

Their relevant documentation:

Differences Between ntpd and chronyd

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/system_administ...

UNDERSTANDING CHRONY AND ITS CONFIGURATION

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_...

 

Note that you will have to disable ntpd, because the cluster checks for ntpd first and chronyd second.

 

Regards,

Ryan Blough, COE

Cloudera Inc.

View solution in original post

3 REPLIES 3
Highlighted

Re: It's time we have a talk about NTP. Why does it keep dying even though I followed the documentation exactly?

Master Collaborator

@Cl0ck  I am not sure if this helps, but in the past when I have had ntp stability issues across a cluster,  I configure my Master Ambari Server to the internal ntp clock server I want.   Then for all the rest of the nodes, I use the Master Ambari Service hostname/ip,  not the ntp servers.   This allows the main machine to get the internal clock time, and share that with all rest of the nodes.

 


 


If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post.  


 


Thanks,



Steven

Highlighted

Re: It's time we have a talk about NTP. Why does it keep dying even though I followed the documentation exactly?

Cloudera Employee

To my knowledge, what the cluster uses to check NPT is the following command:

 

# ntpdc -np

 

This is used in the health checks for the hosts. This is included in the docs under the Host Health Tests. For example in CDH 5.16.x:

 

https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cm_ht_host.html#concept_lxn_zxn_yk

 

You might also consider Chrony over NTP. From a previous discussion I had with a subject matter expert at Red Hat, these are the things chronyd can do better than ntpd:

- chronyd can work well in an environment where access to the time reference is intermittent, whereas ntpd needs regular polling of time reference to work well.
- chronyd can perform well even when the network is congested for longer periods of time.
- chronyd can usually synchronize the clock faster and with better accuracy.
- chronyd quickly adapts to sudden changes in the rate of the clock, for example, due to changes in the temperature of the crystal oscillator, whereas ntpd may need a long time to settle down again.
- In the default configuration, chronyd never steps the time after the clock has been synchronized at system start, in order not to upset other running programs. ntpd can be configured to never step the time too, but it has to use a different means of adjusting the clock, which has some disadvantages including negative effect on accuracy of the clock.
- chronyd can adjust the rate of the clock on a Linux system in a larger range, which allows it to operate even on machines with a broken or unstable clock. For example, on some virtual machines.
- chronyd is smaller, it uses less memory and it wakes up the CPU only when necessary, which is better for power saving.

 

I think the intermittent connection and virtual-machine specific details might apply to your use case.

 

Their relevant documentation:

Differences Between ntpd and chronyd

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/system_administ...

UNDERSTANDING CHRONY AND ITS CONFIGURATION

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_...

 

Note that you will have to disable ntpd, because the cluster checks for ntpd first and chronyd second.

 

Regards,

Ryan Blough, COE

Cloudera Inc.

View solution in original post

Re: It's time we have a talk about NTP. Why does it keep dying even though I followed the documentation exactly?

Explorer

Thank for you this reply! This has been quite difficult for me to troubleshoot, but I finally figured it out. These machines I've been using had chrony on them all along, but the previous machines I set up did not have chrony installed. Chrony and ntpd were both enabled, and ntpd was getting exited on reboot. Because the host monitor issues "ntpq -np", and ntpd was loaded but inactive, it would report a failure to query the server, even though chrony was running. I had no idea that chrony was installed, and thus, the whole problem could've been solved by just disabling/uninstalling ntpd. I spent WAY too many hours to come to such a simple solution.

It may be very helpful to someone who doesn't understand network time protocols very well if there was a suggestion to explain potential conflicts between ntpd and chronyd in the documentation, or even to take a second to check which (if any) you already have installed. Maybe it won't be an issue for most people, but for me, assuming that I didn't have chrony already running cost me a bunch of time getting my cluster healthy.

I would check, find ntpd dead, see no problems reported on Host Monitor, wonder why the hell ntpd died, kill ntpd, run ntpdate, restart ntpdate, restart scm-agent, and that would "fix" it, but on reboot it would go back to using chrony and exit ntpd, and host monitor would report failure to query ntp service, even though the machine was using chrony and synced just fine all along.

I appreciate your help!

Don't have an account?
Coming from Hortonworks? Activate your account here