Support Questions

damon.jones · ‎08-24-2015

Cloudera Community,

I have two different Cloudera Manager(CM) managed clusters that I upgraded from 5.3.3 to (5.4.1 and 5.4.3). On both clusters I get random "Clock Offset Bad" messages thoughout the day. I checked the times on all nodes and they are all in sync. I also checked the NTP configuration and it appears to be accurate. I did not have these issues when running 5.3.3. I was wondering how the CM agent checks for bad offset clock? Is there something that changed between 5.3.3 and 5.4.x?

I'm running:

Cluster 1

OS: Redhat Enterprise 6.6

CDH: 5.4.1

CM: 5.4.1

Cluster 2

OS: Redhat Enterprise 6.6

CDH: 5.4.3

CM: 5.4.3

Thanks.

Kenny Rice · ‎08-28-2015

we get the same problem as well, its extremely annoying.

bgooley · ‎08-28-2015

Hello. The clock NTP health check is executed by each agent running on nodes on your cluster. The command executed is:

ntpdc -np

A timeout of 2 seconds is used, so if the ntp client does not return in 2 seconds, the health check will fail.

If there is a result, then the agent script will parse the result text and return a result metric that includes the clock offset. this will be sent to the Host Monitor Management Service for processing.

You have 2 options here:

If you are convinced there are no problems, you can turn off the Cloudera Manager Server Clock Offset Thresholds health check or adjust it as necessary in the Cloudera Manager management services.

Or, if you wish to troubleshoot, check the /var/log/cloudera-scm-agent/cloudera-scm-agent.log file for clues.

Search in that file for "ntpdc". If there are any errors running the command, a stack trace will be provided.

The agent merely parses the ntpdc output, so assuming your output looks something like this:

ntpdc -np

remote local st poll reach delay offset disp

=======================================================================

*132.163.4.101 10.17.81.194 1 1024 377 0.02972 0.001681 0.13664

=198.55.111.5 10.17.81.194 2 1024 377 0.01395 0.002177 0.13667

=50.116.55.65 10.17.81.194 2 1024 377 0.07263 0.001220 0.12172

The script will look for a line that starts with an "*" character. So, in our example:

*132.163.4.101 10.17.81.194 1 1024 377 0.02972 0.001681 0.13664

Then, it will get the 'offset' column.

This value is returned to the Host Monitor which, will pull the metric and filter it through your health check configuration to decide if it warrants an alert.

Lastly, I'm not aware that anything has changed in the offset health check between CM 5.3 and 5.4, so I would recommend troubleshooting this to try to figure out why clock is offset. Timing is important in hadoop, so it is worth a look.

Regards,

Ben

damon.jones · ‎08-28-2015

I think you may have found my issue. The output from running "ntpdc -np" does not have an entry with an asterisk. Do you know what the asterisk means?
Thanks.
$ ntpdc -np
remote local st poll reach delay offset disp
=======================================================================
=192.168.1.44 192.168.1.28 4 1024 377 0.00110 0.009895 0.18616
=192.168.1.43 192.168.1.28 4 1024 377 0.00099 0.004189 0.18646

mertez · ‎10-21-2015

Has anybody found a solution for this? I am running the VM on ESXi.

In my case ff I run 'ntpdc -np' after a bad allert I get following values:

     remote           local      st poll reach  delay   offset    disp
===================================================
=193.2.4.2       192.168.2.251    2   64  177 0.00102 -46.89972 0.25188
=193.2.120.3     192.168.2.251    2   64  177 0.00383 -46.90043 0.25189
=109.127.214.126 192.168.2.251    2   64  177 0.00117 -46.89955 0.25185

After 'service ntpd restart' offsets with a harwdare clock seems to be fixed:

     remote           local      st poll reach  delay   offset    disp
=======================================================================
=89.212.75.6     192.168.2.251   16   64    0 0.00000  0.000000 4.00000
*109.127.214.126 192.168.2.251    2   64    1 0.00117  0.000123 2.81735
=84.255.235.43   192.168.2.251   16   64    0 0.00000  0.000000 4.00000

But after few minutes offsets drifts away. Any ideas?

cchahadoop · ‎11-04-2015

The asterisk (*) basically means that the NTP Daemon is in sync with the particular Remote Server (getting it's time from this server)

Were you able to move past this issue? I'm facing a similar issue where the clocks randomly go out of sync on/off. I checked with my Network team for any firewall changes but there haven't been any.

mertez · ‎11-05-2015

Unfortunately the problem in my case still persist.

damon.jones · ‎11-05-2015

Yes, this helped me to better understand the issue but in the end we had a bad NTP server. I appreciate the help.

jarrett-261720071 · ‎04-09-2016

what is the "=" exactly mean then? I am not seeing anything in th cm agent logs or syslog. from what i read from https://www.eecis.udel.edu/~mills/ntp/html/ntpdc.html

"a = means the remote server is being polled in client mode". So does not mean we should look for those lines as well? My entire environment is alerting on this now, all of a sudden and I have not found any issue yet as to the cause. My environment is running Ubuntu 12.04 and some 14.04 and ntpdate works with the configured server i have set.

zuoseven · ‎07-01-2019

hi，man, did you fixed this problem,i have the same too.

Support Questions

Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server