Support Questions

Find answers, ask questions, and share your expertise

Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server

avatar
Explorer

Cloudera Community,

 

I have two different Cloudera Manager(CM) managed clusters that I upgraded from 5.3.3 to (5.4.1 and 5.4.3). On both clusters I get random "Clock Offset Bad" messages thoughout the day. I checked the times on all nodes and they are all in sync. I also checked the NTP configuration and it appears to be accurate. I did not have these issues when running 5.3.3. I was wondering how the CM agent checks for bad offset clock? Is there something that changed between 5.3.3 and 5.4.x?

 

I'm running:

 

Cluster 1

OS: Redhat Enterprise 6.6

CDH: 5.4.1

CM:  5.4.1

 

Cluster 2

OS: Redhat Enterprise 6.6

CDH: 5.4.3

CM: 5.4.3

 

Thanks.

17 REPLIES 17

avatar
New Contributor

we get the same problem as well, its extremely annoying.  

avatar
Master Guru

Hello.  The clock NTP health check is executed by each agent running on nodes on your cluster.  The command executed is:

ntpdc -np

 
A timeout of 2 seconds is used, so if the ntp client does not return in 2 seconds, the health check will fail.
If there is a result, then the agent script will parse the result text and return a result metric that includes the clock offset.  this will be sent to the Host Monitor Management Service for processing.
 
You have 2 options here:  
 
If you are convinced there are no problems, you can turn off the Cloudera Manager Server Clock Offset Thresholds health check or adjust it as necessary in the Cloudera Manager management services.
 
Or, if you wish to troubleshoot, check the /var/log/cloudera-scm-agent/cloudera-scm-agent.log file for clues.
Search in that file for "ntpdc".  If there are any errors running the command, a stack trace will be provided.
 
The agent merely parses the ntpdc output, so assuming your output looks something like this:
 
ntpdc -np
     remote           local      st poll reach  delay   offset    disp
=======================================================================
*132.163.4.101   10.17.81.194     1 1024  377 0.02972  0.001681 0.13664
=198.55.111.5    10.17.81.194     2 1024  377 0.01395  0.002177 0.13667
=50.116.55.65    10.17.81.194     2 1024  377 0.07263  0.001220 0.12172
 
The script will look for a line that starts with an "*" character.  So, in our example:
 
*132.163.4.101   10.17.81.194     1 1024  377 0.02972  0.001681 0.13664
 
Then, it will get the 'offset' column.
This value is returned to the Host Monitor which, will pull the metric and filter it through your health check configuration to decide if it warrants an alert.
 
Lastly, I'm not aware that anything has changed in the offset health check between CM 5.3 and 5.4, so I would recommend troubleshooting this to try to figure out why clock is offset.  Timing is important in hadoop, so it is worth a look.
 
Regards,
 
Ben

avatar
Explorer
I think you may have found my issue. The output from running "ntpdc -np" does not have an entry with an asterisk. Do you know what the asterisk means?
Thanks.
$ ntpdc -np
remote local st poll reach delay offset disp
=======================================================================
=192.168.1.44 192.168.1.28 4 1024 377 0.00110 0.009895 0.18616
=192.168.1.43 192.168.1.28 4 1024 377 0.00099 0.004189 0.18646


avatar
Explorer

Has anybody found a solution for this? I am running the VM on ESXi.


In my case ff I run 'ntpdc -np' after a bad allert I get following values:


     remote           local      st poll reach  delay   offset    disp
===================================================
=193.2.4.2       192.168.2.251    2   64  177 0.00102 -46.89972 0.25188
=193.2.120.3     192.168.2.251    2   64  177 0.00383 -46.90043 0.25189
=109.127.214.126 192.168.2.251    2   64  177 0.00117 -46.89955 0.25185





After 'service ntpd restart' offsets with a harwdare clock seems to be fixed:

     remote           local      st poll reach  delay   offset    disp
=======================================================================
=89.212.75.6     192.168.2.251   16   64    0 0.00000  0.000000 4.00000
*109.127.214.126 192.168.2.251    2   64    1 0.00117  0.000123 2.81735
=84.255.235.43   192.168.2.251   16   64    0 0.00000  0.000000 4.00000




But after few minutes offsets drifts away. Any ideas?

avatar
Rising Star
The asterisk (*) basically means that the NTP Daemon is in sync with the particular Remote Server (getting it's time from this server)

Were you able to move past this issue? I'm facing a similar issue where the clocks randomly go out of sync on/off. I checked with my Network team for any firewall changes but there haven't been any.

avatar
Explorer
Unfortunately the problem in my case still persist.

avatar
Explorer
Yes, this helped me to better understand the issue but in the end we had a bad NTP server. I appreciate the help.



avatar
New Contributor

what is the "=" exactly mean then?  I am not seeing anything in th cm agent logs or syslog.  from what i read from https://www.eecis.udel.edu/~mills/ntp/html/ntpdc.html

 

"a = means the remote server is being polled in client mode".  So does not mean we should look for those lines as well?  My entire environment is alerting on this now, all of a sudden and I have not found any issue yet as to the cause.  My environment is running Ubuntu 12.04 and some 14.04 and ntpdate works with the configured server i have set.

 

 

avatar
Explorer

hi,man, did you fixed this  problem,i have the same too.