Support Questions

Find answers, ask questions, and share your expertise

Two UUID tablet server after restart where the wal directory was lost.

avatar
New Contributor

We faced the problem on our production kudu cluster. The hard disk with wal catalog was failed on the tablet server. We install new disk and clear data directory according to Kudu documentation https://kudu.apache.org/docs/administration.html#rebuilding_kudu . After starting the failing tablet server we have seen that kudu ksck displayed two instance tablet server for one server with different UUID. One of this server had status "WRONG SERVER_UUID".

DmitriyKoch_0-1666362337997.png

 

Why may the error occure? Are there any ways to avoid it? Is there way to solve the problem without restarting master server?

 

Also found the command "kudu tserver unregister" for removing tablet server with wrong UUID but we hadn't found this step in documentation.

 

Steps for reproduce the similar problem:

1.Install Apache Kudu Quickstart.

Instructions - https://kudu.apache.org/docs/quickstart.html#_bring_up_the_cluster

Clone the Apache Kudu repository using Git and change to the kudu directory:

$ git clone https://github.com/apache/kudu

$ cd kudu

Set the KUDU_QUICKSTART_IP environment variable to your ip address:

$ export KUDU_QUICKSTART_IP=$(ifconfig | grep "inet " | grep -Fv 127.0.0.1 |  awk '{print $2}' | tail -1)

Bring up the Cluster:

$ docker-compose -f docker/quickstart.yml up -d

Check the cluster health:

$ docker exec -it $(docker ps -aqf "name=kudu-master-1") /bin/bash

$ kudu cluster ksck kudu-master-1:7051,kudu-master-2:7151,kudu-master-3:7251

 

2.Remove --fs_wal_dir from one of the tablet servers, after Tablet server container starts crashing.

$ docker exec -it $(docker ps -aqf "name=kudu-tserver-1") /bin/bash

$ rm -rf /var/lib/kudu/tserver/wals

 

3.Delete directories, --fs_metadata_dir and --fs_data_dirs tablet server, default:

$ docker start docker-kudu-tserver-1-1

$ docker exec -it $(docker ps -aqf "name=kudu-tserver-1") /bin/bash

$ rm -rf /var/lib/kudu/tserver/

 

4.Restart tablet server.

$ docker stop docker-kudu-tserver-1-1

$ docker start docker-kudu-tserver-1-1

 

5.Execute <kudu cluster ksck kudu-master-1:7051,kudu-master-2:7151,kudu-master-3:7251>. We get two UUID tablet server. One is in the status "WRONG SERVER_UUID".

$ docker exec -it $(docker ps -aqf "name=kudu-master-1") /bin/bash

$ kudu cluster ksck kudu-master-1:7051,kudu-master-2:7151,kudu-master-3:7251

1 REPLY 1

avatar
Expert Contributor

Its Because of this Disk change where UUID of this TS was different but after change of WAL it created new one hence you seeing the wrong_server_UUID , simple restart of this TS will fix the issue in most of cases if not then please rebuild this TS from scratch while deleting the data and wal dir of this one it will solve the issue

Note: Only proceed with rebuild if you have RF=3 in the cluster otherwise it will be a dataloss scenario