Support Questions
Find answers, ask questions, and share your expertise

kudu tablet server failed to start

Highlighted

kudu tablet server failed to start

New Contributor

Hi,

 

I have a problem with kudu on CDH 5.14.3. I have 3 master and 3 tablet servers.

After start, one of 3 tablet server, it downs after a few seconds. I checked kudu logs but I didn't see any important trace to get some more information.

 

The only trace on fatal and error log was:

 

Log file created at: 2019/12/12 16:59:30
Running on machine: mymachine.example.com
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F1212 16:59:30.142372 87271 compaction.cc:778] Check failed: pv_delete_redo != nullptr

 

Any idea? It is possible to resynchronize this tablet server with other two?

 

Thanks

2 REPLIES 2
Highlighted

Re: kudu tablet server failed to start

Contributor
Hi, It seems you hit KUDU-2233 (https://issues.apache.org/jira/browse/KUDU-2233) which is a nasty bug in Kudu 1.6 reported here - https://docs.cloudera.com/documentation/enterprise/release-notes/topics/kudu_known_issues.html#tsb-2... (please refer the complete stack trace in kudu tserver stderr log) KUDU-2233 is fixed in kudu 1.7.0 which is available from CDH 5.15.0. You may consider upgrading the CDH version to CDH 5.15.0 or later. Refer the release note here - https://docs.cloudera.com/documentation/enterprise/release-notes/topics/kudu_release_notes.html#reln... Regards, PDas

Re: kudu tablet server failed to start

Contributor

If CDH upgrade is not possible immediately, setting the "-log_min_segments_to_retain=2" will only prevent the further corruption. 

 

To do this, go to the Kudu configuration page in Cloudera Manager and add "-log_min_segments_to_retain=2" to the Kudu Service Advanced Configuration Snippet (Safety Valve) for gflagfile option.

Note: When CDH upgraded to a version containing a fix, remove -log_min_segments_to_retain setting.

 

However, if the tablet failed to start after setting  "-log_min_segments_to_retain=2" , then we have to find the corrupt tablet on the affected TS and delete it in order to bring the TServer up.