Support Questions

Find answers, ask questions, and share your expertise

KUDU master consensus conflicts

avatar
Expert Contributor

Hello,

In a CDP 7.1.6 + Cloudera Manager 7.3.1 cluster.

3Masters +3Workers.

I'm getting the error all the time: Corruption: master consensus error: there are master consensus conflicts

 

This is the cluster ksck:

Master Summary
UUID | Address | Status
----------------------------------+--------------------------+---------
5620e4a103894151b7bdee5e436f37d8 | master-2.local | HEALTHY
9cea3b56cc9b4be4846a02c0d89be753 | master-1.local | HEALTHY
a98a1f26d0254293b6e17e9daf8f6ef8 | master-3.local | HEALTHY
All reported replicas are:
A = 9cea3b56cc9b4be4846a02c0d89be753
B = 5620e4a103894151b7bdee5e436f37d8
C = a98a1f26d0254293b6e17e9daf8f6ef8
The consensus matrix is:
Config source | Replicas | Current term | Config index | Committed?
---------------+--------------+--------------+--------------+------------
A | A B C | 10120 | -1 | Yes
B | A B* C | 10120 | -1 | Yes
C | A B* C | 10120 | -1 | Yes

 

It seems the A node is not voting, this is the log output:
W1111 11:12:00.526211 18688 leader_election.cc:334] T 00000000000000000000000000000000 P 9cea3b56cc9b4be4846a02c0d89be753 [CANDIDATE]: Term 10122 pre-election: RPC error from VoteRequest() call to peer 5620e4a103894151b7bdee5e436f37d8 (master-2:7051): Network error: Client connection negotiation failed: client connection to 10.157.136.55:7051: connect: Connection refused (error 111)
W1111 11:12:22.683107 18688 leader_election.cc:334] T 00000000000000000000000000000000 P 9cea3b56cc9b4be4846a02c0d89be753 [CANDIDATE]: Term 10122 pre-election: RPC error from VoteRequest() call to peer 5620e4a103894151b7bdee5e436f37d8 (master-2:7051): Timed out: RequestConsensusVote RPC to 10.157.136.55:7051 timed out after 7.916s (SENT)

 

there is conectivity:

# nc -z -v 10.157.136.55 7051
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.157.136.55:7051.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

and the masters have been restarted several times, and the whole cluster...

 

Any idea to fix this? Thanks!

1 REPLY 1

avatar
Expert Contributor

Hi Team as per the issue Master A is the fault one because once its in line there is no election happening please follow the below step to sync it you need to cleansafe the data from this master and replicate it back from leader master

A)stop the problematic Master

B)sudo -u kudu kudu local_replica delete --fs_wal_dir=/wal/kudu/wal --fs_data_dirs=/wal/kudu/data 00000000000000000000000000000000 -clean_unsafe

C)sudo -u kudu kudu local_replica copy_from_remote --fs_wal_dir=/wal/kudu/wal --fs_data_dirs=/wal/kudu/data 00000000000000000000000000000000 <active_leader_fqdn>:7051

D)stop remaining two masters

E)then start all the three masters.