Support Questions
Find answers, ask questions, and share your expertise

kudu service are getting down frequently

Rising Star

Hi, Kudu crashing frequently with error Couldn't get the current time: Clock unsynchronized. Status: Service unavailable: Error reading clock. Clock considered unsynchronized however i am not seeing any clock offset error in cluster . kudu documents says this could be because of network delay between NTP server and kudu host. but kudu is sharing host with database while datanode is not reporting clock offset error . what could be the reason .

22 REPLIES 22

Rising Star

HDFS datanodes don't require clock synchronization in the way that Kudu does.

 

Is NTP running on these nodes? What is the output of the 'ntptime' command? Are these nodes running on physical hardware, or something else?

Super Collaborator

I have the same issue, and didn't finish a solution for this.

For now i added a cron that restarted the ntpd service at all the server each hour.

 

This issue prevent me from going with Kudu to production as it doesn't make since to do the restart for 50 nodes each time.

Rising Star
@adar - yes DN not require NTP but if ntp is out of sync on these DNs CM will report clock offset .

NTP is running on DNs .

[root@wuwcw0hd3dn01 hadoop-hdfs]# ntptime
ntp_gettime() returns code 5 (ERROR)
time dce74398.988fc000 Sun, Jun 11 2017 0:20:40.595, (.595943),
maximum error 16000000 us, estimated error 16 us, TAI offset 0
ntp_adjtime() returns code 5 (ERROR)
modes 0x0 (),
offset 0.000 us, frequency 0.000 ppm, interval 1 s,
maximum error 16000000 us, estimated error 16 us,
status 0x4041 (PLL,UNSYNC,MODE),
time constant 7, precision 1.000 us, tolerance 500 ppm,
[root@wuwcw0hd3dn01 hadoop-hdfs]#

these all are physical servers

Rising Star

You can avoid the dependency on ntpd by running Kudu with --use-hybrid-clock=false, but that has a serious effect on transactional consistency so it's not something we recommend. Instead, I'd focus your efforts on figuring out why your servers' time isn't synchronized. It may have to do with your ntp configuration.

 

Unfortunately I don't know how ntp works; perhaps you can search across past forum posts? If you do manage to fix this, please post your findings here; if it's a general purpose fix (i.e. not particular to your site configuration), we'll include it in the Kudu documentation.

 

Super Collaborator

@MSharma Did you find a solution for this?

 

i'm still stuck with it

Rising Star
not yet but restarting ntp service cause more trouble so i have put --use-hybrid-clock=false .
but mostly it is a network delay between ntp server and kudu server which is causing this .
i am still troubleshooting this problem ,will update here if we can do anything to reslove it

New Contributor
 I'm also in the trouble,when I restart the ntpd service,and it going to successful to restart the kudu service.But not a long time,it return to fail status.I see the kudu management page,there is a tip to solve the problem,it sail"for the master and tablet server daemons,the server’s clock must be synchronized using NTP.In addition,the maximum clock error(not to be mistaken with the estimate error) be below a configurable threshold.The default value is 10 seconds,but it can be set with the flag --max_clock_sync_error_usec." the kudu management page(https://kudu.apache.org/docs/troubleshooting.html) provide the solution,but I don't know how to and where to set the parameter"--max_clock_sync_error_usec." thanks.

Rising Star
kudu --> configuration -- "Kudu Service Advanced Configuration Snippet (Safety Valve) for gflagfile

New Contributor
Thank you very much.But I still have a trouble,there is no edit blank for me to change,and only a check box.How should I add the parameter to it..... 抑制参数验证:Kudu Service Environment Advanced Configuration Snippet (Safety Valve) Kudu(服务范围)   抑制参数验证:Kudu Service Advanced Configuration Snippet (Safety Valve) for gflagfile Kudu(服务范围)   抑制参数验证:Kudu Service Advanced Configuration Snippet (Safety Valve) for kudu-monitoring.properties Kudu(服务范围)   抑制参数验证:Service Monitor Derived Configs Advanced Configuration Snippet (Safety Valve) Kudu(服务范围)  

Rising Star
you need to be login as an administrator to add these values.

Rising Star

I was able to resolve it .

 

if you see error with ntptime mostly kudu service will go down , so you have to restart ntpd and then this error will go .

[root@wcw0hd3dn02 ~]# ntptime

ntp_gettime() returns code 5 (ERROR)

  time dce7466c.fc37b000  Sun, Jun 11 2017  0:32:44.985, (.985225),

  maximum error 16000000 us, estimated error 16 us, TAI offset 0

ntp_adjtime() returns code 5 (ERROR)

  modes 0x0 (),

  offset 0.000 us, frequency 0.000 ppm, interval 1 s,

  maximum error 16000000 us, estimated error 16 us,

  status 0x4041 (PLL,UNSYNC,MODE),

  time constant 7, precision 1.000 us, tolerance 500 ppm,

 

 

this error comes if you run ntp with -x option 

 

[root@wuwcw0hd3mn01 ~]# ps -ef|grep ntp
root 3183 2731 0 10:38 pts/0 00:00:00 grep ntp
ntp 20736 1 0 Jun13 ? 00:00:19 ntpd -x -u ntp:ntp -p /var/run/ntpd.pid -g

 

remove -X from belwo file and restart ntp 

[root@wuwcw0hd3mn01 ~]# more /etc/sysconfig/ntpd
# Drop root to id 'ntp:ntp' by default.
OPTIONS="-x -u ntp:ntp -p /var/run/ntpd.pid -g"

 

wait for ntp to synchronize ,after that i didnt see any issue in kudu service so far.

 

ntp_gettime() returns code 0 (OK)
time dcf260c3.66c6abfc Mon, Jun 19 2017 10:40:03.401, (.401469911),
maximum error 394157 us, estimated error 345 us, TAI offset 0
ntp_adjtime() returns code 0 (OK)
modes 0x0 (),
offset -707.277 us, frequency 20.094 ppm, interval 1 s,
maximum error 394157 us, estimated error 345 us,
status 0x6001 (PLL,NANO,MODE),
time constant 10, precision 0.001 us, tolerance 500 ppm,

 

 

check this https://access.redhat.com/solutions/38542

 

 

Expert Contributor

Thank you @MSharma for updating us on your findings!

 

That is very interesting. It sounds like you enabled NTP stepping (which -x disables) whereas before it could only use slewing. Apparently stepping has kept you from falling too far out of sync from the time source for Kudu to tolerate.

 

I just checked and on one of the more long-lived and stable test environments I periodically use (note: it is NOT a production system) where I have run many different versions of Kudu over the years I do not have -x set in OPTIONS. On that machine (running CentOS 6.6) there is only the following in /etc/sysconfig/ntpd:

 

# Drop root to id 'ntp:ntp' by default.
OPTIONS="-u ntp:ntp -p /var/run/ntpd.pid -g"

 

It may be worth noting that this machine also has the following set in /etc/ntp.conf:

 

# tinker panic 0 instructs NTP not to give up if it sees a large jump in time.
# This is important for coping with large time drifts and also resuming virtual
# machines from their suspended state.
tinker panic 0 # Permit time synchronization with our time source, but do not # permit the source to query or modify the service on this system. restrict default kod nomodify notrap nopeer noquery restrict -6 default kod nomodify notrap nopeer noquery # Drift file. Put this in a directory which the daemon can write to. # No symbolic links allowed, either, since the daemon updates the file # by creating a temporary in the same directory and then rename()'ing # it to the file. driftfile /var/lib/ntp/drift

I can't tell you whether or not this is an ideal NTP configuration, or whether it is fully correct, but it seems stable.

 

For those who want more insight into what -x and slewing means, I'd recommend looking at the ntpd(8) man page and doing a Find for the keyword "slew": https://linux.die.net/man/8/ntpd

 

For those having problems with NTP stability in general, also consider reading through the "NTP Debugging Techniques" section of the Official NTP Documentation: http://doc.ntp.org/4.2.6p5/debug.html

New Contributor
I have been fighting the same problem. MSharma's instructions worked for me as well. Still not really clear on what it does (will read the slewing link provided by mpercy), but it at least works.

New Contributor
Just a quick follow-up on my note, above.

It seems to me that Kudu has an incompatibility with NTP slew. I have spent a solid day testing various scenarios, and if slew is on, Kudu won't start (or will eventually crash).

This does not seem to be related to time, as the time does not appear to creep away from what is expected.

Just that little "-x" on the ntpd command (either on the command line, or in the /etc/sysconfig/ntpd file) makes the different. With it, Kudu won't start, or will crash. Without it, all is fine.

I am using Kudu (parcel) with CDH 5.10.1.

For now, I will just continue without slew, and I post this in the hopes of helping others as well as follow on comments of other solutions. Certainly, I'd be interested in any one with a working Kudu to try to turn on slew and restart ntpd. Then does Kudu start correctly?

Thanks!!!!

Expert Contributor

Any chance some of you are running on Azure? It has known issues with ntp: https://social.msdn.microsoft.com/Forums/azure/en-US/8c0a1026-0b02-405a-848e-628e68229eaf/i-have-a-l...

New Contributor
Not running Azure.

Expert Contributor

Hi folks,

 

I spent some time looking into this and agree that running ntpd with the '-x' option will make Kudu crash (likely after 8 hours and 53 minutes based on my math). I wrote some details here:

 

https://issues.apache.org/jira/browse/KUDU-2079

 

-Todd

Super Collaborator

Is that mean i need to broke all my production servers and give it special treatment by enabling stepping to get Kudu working for me?

 

Seems we are so far to go with it to our production, i'm using it in our test enviroment.

Expert Contributor
You could use 'tinker step 500' and have the effect that stepping would
only be enabled for time differences more than 500ms. I wouldn't consider
this breaking your production environment, but I guess you may have some
reason that '-x' is important to you.

We'll work on addressing this in a future release so that no system-wide
changes are necessary.

-Todd

Expert Contributor
I would strongly recommend NOT running with hybrid time turned off for one
simple reason: tablet history GC will not work. Therefore when you delete
or update a row the history of that data will be kept forever. Eventually
you may run out of disk space. The one exception is if you drop a table,
then the data for that table will be permanently removed regardless of
hybrid time.
; ;