Support Questions

Find answers, ask questions, and share your expertise

Unstable Kudu Master

avatar
Expert Contributor

Hello,

In my 3Masters cluster, one Kudu Master is starting and stopping all the time, this is the Log detail from Cloudera Manager: 

Time Log Level Source Log Message
10:14:41.417 AM WARN cc:288
Found duplicates in --master_addresses: the unique set of addresses is Master1:7051, Master2:7051, Master3:7051
10:15:11.823 AM WARN cc:254
Call kudu.consensus.ConsensusService.RequestConsensusVote from 10.157.136.55:55402 (request call id 0) took 4542 ms (4.54 s). Client timeout 1775 ms (1.78 s)
10:15:11.823 AM WARN cc:254
Call kudu.consensus.ConsensusService.RequestConsensusVote from 10.157.136.37:59796 (request call id 0) took 30215 ms (30.2 s). Client timeout 9654 ms (9.65 s)
10:15:11.823 AM WARN cc:260
Trace:
1112 10:15:07.281146 (+ 0us) service_pool.cc:169] Inserting onto call queue
1112 10:15:07.281169 (+ 23us) service_pool.cc:228] Handling call
1112 10:15:11.823245 (+4542076us) inbound_call.cc:171] Queueing success response
Metrics: {"spinlock_wait_cycles":384}
10:15:11.823 AM WARN cc:260
Trace:
1112 10:14:41.607787 (+ 0us) service_pool.cc:169] Inserting onto call queue
1112 10:14:41.607839 (+ 52us) service_pool.cc:228] Handling call
1112 10:15:11.823242 (+30215403us) inbound_call.cc:171] Queueing success response
Metrics: {}
10:15:11.823 AM WARN cc:254
Call kudu.consensus.ConsensusService.RequestConsensusVote from 10.157.136.55:55402 (request call id 1) took 4536 ms (4.54 s). Client timeout 1955 ms (1.96 s)
10:15:11.823 AM WARN cc:260
Trace:
1112 10:15:07.286988 (+ 0us) service_pool.cc:169] Inserting onto call queue
1112 10:15:07.287025 (+ 37us) service_pool.cc:228] Handling call
1112 10:15:11.823244 (+4536219us) inbound_call.cc:171] Queueing success response
Metrics: {}

 

What does it means???

why is this so unconsistent? 

Juanes_0-1668248689092.png

 

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hello,

I got the fix for this case, maybe this could help anyone having the same kudu Master consensus issue than me.

Master1 is not voting:

The consensus matrix is:
Config source | Replicas | Current term | Config index | Committed?
---------------+--------------+--------------+--------------+------------
Master1 A | A B C | 12026 | -1 | Yes
Master2 B | A B C* | 12026 | -1 | Yes
Master3 C | A B C* | 12026 | -1 | Yes

the workarround is:

A)stop the problematic Master and run the below command on Problematic master
B)sudo -u kudu kudu local_replica delete --fs_wal_dir=/var/kudu/master --fs_data_dirs=/var/kudu/master 00000000000000000000000000000000 -clean_unsafe
C) Please check the kudu leader master with webUI
a98a1f26d0254293b6e17e9daf8f6ef8 822fcc68eff448269c9200a8c4c2ecc8 LEADER 2022-11-22 07:18:21 GMT
rpc_addresses { host: "sdzw-hpas-35" port: 7051 } http_addresses { host: "sdzw-hpas-35" port: 8051 } software_version: "kudu 1.13.0.7.1.6.0-297 (rev 9323384dbd925202032a965e955979d6d2f6acb0)" https_enabled: false
D)sudo -u kudu kudu local_replica copy_from_remote --fs_wal_dir=/wal/kudu/wal --fs_data_dirs=/wal/kudu/data 00000000000000000000000000000000 <active_leader_fqdn>:7051
# sudo -u kudu /opt/cloudera/parcels/CDH-7.1.6-1.cdh7.1.6.p0.10506313/bin/../lib/kudu/bin/kudu local_replica copy_from_remote --fs_wal_dir=/var/kudu/master --fs_data_dirs=/var/kudu/master 00000000000000000000000000000000 sdzw-hpas-35.nrtsz.local:7051
E)stop remaining two masters
F)then start all the three masters.

 

View solution in original post

1 REPLY 1

avatar
Expert Contributor

Hello,

I got the fix for this case, maybe this could help anyone having the same kudu Master consensus issue than me.

Master1 is not voting:

The consensus matrix is:
Config source | Replicas | Current term | Config index | Committed?
---------------+--------------+--------------+--------------+------------
Master1 A | A B C | 12026 | -1 | Yes
Master2 B | A B C* | 12026 | -1 | Yes
Master3 C | A B C* | 12026 | -1 | Yes

the workarround is:

A)stop the problematic Master and run the below command on Problematic master
B)sudo -u kudu kudu local_replica delete --fs_wal_dir=/var/kudu/master --fs_data_dirs=/var/kudu/master 00000000000000000000000000000000 -clean_unsafe
C) Please check the kudu leader master with webUI
a98a1f26d0254293b6e17e9daf8f6ef8 822fcc68eff448269c9200a8c4c2ecc8 LEADER 2022-11-22 07:18:21 GMT
rpc_addresses { host: "sdzw-hpas-35" port: 7051 } http_addresses { host: "sdzw-hpas-35" port: 8051 } software_version: "kudu 1.13.0.7.1.6.0-297 (rev 9323384dbd925202032a965e955979d6d2f6acb0)" https_enabled: false
D)sudo -u kudu kudu local_replica copy_from_remote --fs_wal_dir=/wal/kudu/wal --fs_data_dirs=/wal/kudu/data 00000000000000000000000000000000 <active_leader_fqdn>:7051
# sudo -u kudu /opt/cloudera/parcels/CDH-7.1.6-1.cdh7.1.6.p0.10506313/bin/../lib/kudu/bin/kudu local_replica copy_from_remote --fs_wal_dir=/var/kudu/master --fs_data_dirs=/var/kudu/master 00000000000000000000000000000000 sdzw-hpas-35.nrtsz.local:7051
E)stop remaining two masters
F)then start all the three masters.