We use CDH to manage our data using Kudu and Impala. Whenever we restart Kudu, we see very high start up times to the tune of an hour or more. When we run ksck during this time we get the following (summary at the end):
================== Errors: ================== table consistency check error: Corruption: 51 out of 74 table(s) are bad FAILED Runtime error: ksck discovered errors
All the other tables are either under replicated or unavailable. Is this normal? Any help in making this faster will be appreciated.
Hi Razee, this issue at startup is improved a lot in CDH 5.15, see https://www.cloudera.com/documentation/enterprise/release-notes/topics/kudu_release_notes.html#relno...
From that page:
The strategy Kudu uses for automatically healing tablets which have lost a replica due to server or disk failures has been improved. The new re-replication strategy, or replica management scheme, first adds a replacement tablet replica before evicting the failed one. With the previous replica management scheme, the system first evicts the failed replica and then adds a replacement. The new replica management scheme allows for much faster recovery of tablets in scenarios where one tablet server goes down and then returns back shortly after 5 minutes or so. The new scheme also provides substantially better overall stability on clusters with frequent server failures. See KUDU-1097 for more information.