Reply
New Contributor
Posts: 4
Registered: ‎02-12-2015

Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server

Cloudera Community,

 

I have two different Cloudera Manager(CM) managed clusters that I upgraded from 5.3.3 to (5.4.1 and 5.4.3). On both clusters I get random "Clock Offset Bad" messages thoughout the day. I checked the times on all nodes and they are all in sync. I also checked the NTP configuration and it appears to be accurate. I did not have these issues when running 5.3.3. I was wondering how the CM agent checks for bad offset clock? Is there something that changed between 5.3.3 and 5.4.x?

 

I'm running:

 

Cluster 1

OS: Redhat Enterprise 6.6

CDH: 5.4.1

CM:  5.4.1

 

Cluster 2

OS: Redhat Enterprise 6.6

CDH: 5.4.3

CM: 5.4.3

 

Thanks.

New Contributor
Posts: 1
Registered: ‎08-06-2015

Re: Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server

we get the same problem as well, its extremely annoying.  

Posts: 1,108
Topics: 1
Kudos: 285
Solutions: 134
Registered: ‎04-22-2014

Re: Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server

Hello.  The clock NTP health check is executed by each agent running on nodes on your cluster.  The command executed is:

ntpdc -np

 
A timeout of 2 seconds is used, so if the ntp client does not return in 2 seconds, the health check will fail.
If there is a result, then the agent script will parse the result text and return a result metric that includes the clock offset.  this will be sent to the Host Monitor Management Service for processing.
 
You have 2 options here:  
 
If you are convinced there are no problems, you can turn off the Cloudera Manager Server Clock Offset Thresholds health check or adjust it as necessary in the Cloudera Manager management services.
 
Or, if you wish to troubleshoot, check the /var/log/cloudera-scm-agent/cloudera-scm-agent.log file for clues.
Search in that file for "ntpdc".  If there are any errors running the command, a stack trace will be provided.
 
The agent merely parses the ntpdc output, so assuming your output looks something like this:
 
ntpdc -np
     remote           local      st poll reach  delay   offset    disp
=======================================================================
*132.163.4.101   10.17.81.194     1 1024  377 0.02972  0.001681 0.13664
=198.55.111.5    10.17.81.194     2 1024  377 0.01395  0.002177 0.13667
=50.116.55.65    10.17.81.194     2 1024  377 0.07263  0.001220 0.12172
 
The script will look for a line that starts with an "*" character.  So, in our example:
 
*132.163.4.101   10.17.81.194     1 1024  377 0.02972  0.001681 0.13664
 
Then, it will get the 'offset' column.
This value is returned to the Host Monitor which, will pull the metric and filter it through your health check configuration to decide if it warrants an alert.
 
Lastly, I'm not aware that anything has changed in the offset health check between CM 5.3 and 5.4, so I would recommend troubleshooting this to try to figure out why clock is offset.  Timing is important in hadoop, so it is worth a look.
 
Regards,
 
Ben
New Contributor
Posts: 4
Registered: ‎02-12-2015

Re: Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server

I think you may have found my issue. The output from running "ntpdc -np" does not have an entry with an asterisk. Do you know what the asterisk means?
Thanks.
$ ntpdc -np
remote local st poll reach delay offset disp
=======================================================================
=192.168.1.44 192.168.1.28 4 1024 377 0.00110 0.009895 0.18616
=192.168.1.43 192.168.1.28 4 1024 377 0.00099 0.004189 0.18646


New Contributor
Posts: 3
Registered: ‎10-21-2015

Re: Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server

Has anybody found a solution for this? I am running the VM on ESXi.


In my case ff I run 'ntpdc -np' after a bad allert I get following values:


     remote           local      st poll reach  delay   offset    disp
===================================================
=193.2.4.2       192.168.2.251    2   64  177 0.00102 -46.89972 0.25188
=193.2.120.3     192.168.2.251    2   64  177 0.00383 -46.90043 0.25189
=109.127.214.126 192.168.2.251    2   64  177 0.00117 -46.89955 0.25185





After 'service ntpd restart' offsets with a harwdare clock seems to be fixed:

     remote           local      st poll reach  delay   offset    disp
=======================================================================
=89.212.75.6     192.168.2.251   16   64    0 0.00000  0.000000 4.00000
*109.127.214.126 192.168.2.251    2   64    1 0.00117  0.000123 2.81735
=84.255.235.43   192.168.2.251   16   64    0 0.00000  0.000000 4.00000




But after few minutes offsets drifts away. Any ideas?

Contributor
Posts: 34
Registered: ‎09-03-2015

Re: Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server

The asterisk (*) basically means that the NTP Daemon is in sync with the particular Remote Server (getting it's time from this server)

Were you able to move past this issue? I'm facing a similar issue where the clocks randomly go out of sync on/off. I checked with my Network team for any firewall changes but there haven't been any.
New Contributor
Posts: 3
Registered: ‎10-21-2015

Re: Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server

Unfortunately the problem in my case still persist.
New Contributor
Posts: 4
Registered: ‎02-12-2015

Re: Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server

Yes, this helped me to better understand the issue but in the end we had a bad NTP server. I appreciate the help.



Contributor
Posts: 26
Registered: ‎05-12-2015

Re: Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server

[ Edited ]

Hi,

 

I had same issue on one of the 10 node cluster built using VMware hypervisor, so all 10 VM's running on hypervisor.

 

I actually did not face clock offset issue when I built cloudera cluster on a physical machines without using hypervisor. I feel there is latency issue with time sync of 3 minutes while on hypervisor.

 

Please let us know if it is recommended to have hadoop cluster over hypervisor or not.

New Contributor
Posts: 1
Registered: ‎07-29-2014

Re: Cloudera 5.4.x cluster randomly reports "Clock Offset Bad" with working NTP Server

what is the "=" exactly mean then?  I am not seeing anything in th cm agent logs or syslog.  from what i read from https://www.eecis.udel.edu/~mills/ntp/html/ntpdc.html

 

"a = means the remote server is being polled in client mode".  So does not mean we should look for those lines as well?  My entire environment is alerting on this now, all of a sudden and I have not found any issue yet as to the cause.  My environment is running Ubuntu 12.04 and some 14.04 and ntpdate works with the configured server i have set.