Reply
Highlighted
New Contributor
Posts: 2
Registered: ‎01-30-2018

Random Kudu downtime on production cloudera cluster.

We are running cloudera manager

Version: Cloudera Express 5.13.0 (#55 built by jenkins on 20171002-1719 git: bd657e597e6743c458ee2c9aabe808b7c972981c)

 

And our kudu cluster keeps getting unhealthy randomly with one out of the three kudu nodes getting down.

Error it throws on startup -

F0111 15:53:15.641548 98757 tablet_bootstrap.cc:884] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.
~

We are running ntp on all our nodes and is properly synced. We are running ubuntu 14.04. Kudu keeps getting unhealthy randomly while running. No configuration change or anything. Sometimes restart do works but fails most of the time. We have been trying to figure out with the help of logs and internet knowledgebase but to no luck yet.

 

Expert Contributor
Posts: 130
Registered: ‎07-17-2017

Re: Random Kudu downtime on production cloudera cluster.

Hi,
What's the min and the max value of date command on your servers ?

Cloudera Employee
Posts: 65
Registered: ‎09-28-2015

Re: Random Kudu downtime on production cloudera cluster.

Hi,

It sounds likely that you are hitting this bug:
https://issues.apache.org/jira/browse/KUDU-2209

The bug fix for this is included in CDH 5.13.1 as well as 5.14.0, so I'd
recommend upgrading at your convenience.

-Todd
New Contributor
Posts: 2
Registered: ‎01-30-2018

Re: Random Kudu downtime on production cloudera cluster.

Hi Todd,

 

I see and will be testing out the latest version. Do you have any idea if the following issue is related -

 

F0123 14:20:53.231120  2879 tablet_server_main.cc:80] Check failed: _s.ok() Bad status: Corruption: Failed to load FS layout: Could not process records in container /dw/kudu/tablet/data/data/f73861e4f85a4688bdf940c3a7420e51: Data length checksum does not match: Incorrect checksum in file /dw/kudu/tablet/data/data/f73861e4f85a4688bdf940c3a7420e51.metadata at offset 902508: Checksum does not match. Expected: 0. Actual: 1214729159

 

This also is happening to us these days when kudu tablet server just crashes and there is no ERROR or any other appropriate log but when we restart, it fails with the above error.

Cloudera Employee
Posts: 65
Registered: ‎09-28-2015

Re: Random Kudu downtime on production cloudera cluster.

Hi,

That issue has been reported before as
https://issues.apache.org/jira/browse/KUDU-1989 but we haven't been able to
reproduce it.

Would it be possible to email the .metadata file to me at todd@cloudera.com?
I can take a look and see if we can get closer to a root cause.

Regarding the crashes, you may be able to find an error message if you look
in dmesg or in the stdout/stderr files in the cloudera SCM process
directory.

-Todd
Announcements