Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Random Kudu downtime on production cloudera cluster.

Random Kudu downtime on production cloudera cluster.

New Contributor

We are running cloudera manager

Version: Cloudera Express 5.13.0 (#55 built by jenkins on 20171002-1719 git: bd657e597e6743c458ee2c9aabe808b7c972981c)

 

And our kudu cluster keeps getting unhealthy randomly with one out of the three kudu nodes getting down.

Error it throws on startup -

F0111 15:53:15.641548 98757 tablet_bootstrap.cc:884] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.
~

We are running ntp on all our nodes and is properly synced. We are running ubuntu 14.04. Kudu keeps getting unhealthy randomly while running. No configuration change or anything. Sometimes restart do works but fails most of the time. We have been trying to figure out with the help of logs and internet knowledgebase but to no luck yet.

 

4 REPLIES 4

Re: Random Kudu downtime on production cloudera cluster.

Expert Contributor

Hi,
What's the min and the max value of date command on your servers ?

Re: Random Kudu downtime on production cloudera cluster.

Expert Contributor
Hi,

It sounds likely that you are hitting this bug:
https://issues.apache.org/jira/browse/KUDU-2209

The bug fix for this is included in CDH 5.13.1 as well as 5.14.0, so I'd
recommend upgrading at your convenience.

-Todd

Re: Random Kudu downtime on production cloudera cluster.

New Contributor

Hi Todd,

 

I see and will be testing out the latest version. Do you have any idea if the following issue is related -

 

F0123 14:20:53.231120  2879 tablet_server_main.cc:80] Check failed: _s.ok() Bad status: Corruption: Failed to load FS layout: Could not process records in container /dw/kudu/tablet/data/data/f73861e4f85a4688bdf940c3a7420e51: Data length checksum does not match: Incorrect checksum in file /dw/kudu/tablet/data/data/f73861e4f85a4688bdf940c3a7420e51.metadata at offset 902508: Checksum does not match. Expected: 0. Actual: 1214729159

 

This also is happening to us these days when kudu tablet server just crashes and there is no ERROR or any other appropriate log but when we restart, it fails with the above error.

Re: Random Kudu downtime on production cloudera cluster.

Expert Contributor
Hi,

That issue has been reported before as
https://issues.apache.org/jira/browse/KUDU-1989 but we haven't been able to
reproduce it.

Would it be possible to email the .metadata file to me at todd@cloudera.com?
I can take a look and see if we can get closer to a root cause.

Regarding the crashes, you may be able to find an error message if you look
in dmesg or in the stdout/stderr files in the cloudera SCM process
directory.

-Todd
Don't have an account?
Coming from Hortonworks? Activate your account here