Created on 04-19-2018 04:59 AM - edited 09-16-2022 06:07 AM
Hi,
I am using kudu as sink to flume, to insert data to multiple tables. One of these tables has stopped ingesting records suddenly from today, surprisingly right at 00 Hours (I store data with hour column and day before data is matching exactly with other source). I checked, range partition is present, active leader to tablet server is running, data format has not been changed, but there seems to be issue with leader election. Here are the logs for reference:
Apr 19, 10:40:31.046 AM INFO multi_column_writer.cc:89 Opened CFile writers for 53 column(s) Apr 19, 10:40:31.363 AM INFO raft_consensus.cc:412 T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout) Apr 19, 10:40:31.363 AM INFO raft_consensus.cc:2315 T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 5.152s Apr 19, 10:40:31.363 AM INFO raft_consensus.cc:436 T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election with config: opid_index: 6 OBSOLETE_local: false peers { permanent_uuid: "2fe8aae066c1491788f5044a8db31524" member_type: VOTER last_known_addr { host: "cicloud2.xxx.com" port: 7050 } } peers { permanent_uuid: "9d754b0276bf4a04a46e2fe0d8d414f6" member_type: VOTER last_known_addr { host: "data-node-3.xxx.com" port: 7050 } } Apr 19, 10:40:31.364 AM INFO leader_election.cc:216 T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Requesting pre-vote from peer 2fe8aae066c1491788f5044a8db31524 Apr 19, 10:40:31.365 AM WARN leader_election.cc:272 T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: RPC error from VoteRequest() call to peer 2fe8aae066c1491788f5044a8db31524: Network error: Client connection negotiation failed: client connection to 192.168.1.XX:7050: connect: Connection refused (error 111) Apr 19, 10:40:31.365 AM INFO leader_election.cc:243 T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Election decided. Result: candidate lost. Apr 19, 10:40:31.366 AM INFO raft_consensus.cc:2315 T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 7.498s Apr 19, 10:40:31.366 AM INFO raft_consensus.cc:2061 T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Leader pre-election lost for term 6. Reason: could not achieve majority Apr 19, 10:40:31.583 AM INFO raft_consensus.cc:1396 T 026299227afe42aab3a604e0b6b26dd9 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 6 FOLLOWER]: Handling vote request from an unknown peer 080aabc3c7c94e178e107f09b7959c15 Apr 19, 10:40:31.583 AM INFO raft_consensus.cc:1847 T 026299227afe42aab3a604e0b6b26dd9 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 6 FOLLOWER]: Leader pre-election vote request: Denying vote to candidate 080aabc3c7c94e178e107f09b7959c15 for term 7 because replica has last-logged OpId of term: 6 index: 10, which is greater than that of the candidate, which has last-logged OpId of term: 6 index: 9. Apr 19, 10:40:31.952 AM INFO raft_consensus.cc:412 T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout) Apr 19, 10:40:31.952 AM INFO raft_consensus.cc:2315 T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 7.362s Apr 19, 10:40:31.953 AM INFO raft_consensus.cc:436 T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election with config: opid_index: 7 OBSOLETE_local: false peers { permanent_uuid: "9d754b0276bf4a04a46e2fe0d8d414f6" member_type: VOTER last_known_addr { host: "data-node-3.xxx.com" port: 7050 } } peers { permanent_uuid: "2fe8aae066c1491788f5044a8db31524" member_type: VOTER last_known_addr { host: "cicloud2.xxx.com" port: 7050 } } Apr 19, 10:40:31.953 AM INFO leader_election.cc:216 T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Requesting pre-vote from peer 2fe8aae066c1491788f5044a8db31524 Apr 19, 10:40:31.954 AM WARN leader_election.cc:272 T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: RPC error from VoteRequest() call to peer 2fe8aae066c1491788f5044a8db31524: Network error: Client connection negotiation failed: client connection to 192.168.1.XX:7050: connect: Connection refused (error 111) Apr 19, 10:40:31.954 AM INFO leader_election.cc:243 T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Election decided. Result: candidate lost. Apr 19, 10:40:31.955 AM INFO raft_consensus.cc:2315 T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 19.615s Apr 19, 10:40:31.955 AM INFO raft_consensus.cc:2061 T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Leader pre-election lost for term 6. Reason: could not achieve majority Apr 19, 10:40:32.162 AM INFO raft_consensus.cc:412 T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout) Apr 19, 10:40:32.162 AM INFO raft_consensus.cc:2315 T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 10.463s Apr 19, 10:40:32.162 AM INFO raft_consensus.cc:436 T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election with config: opid_index: 9 OBSOLETE_local: false peers { permanent_uuid: "9d754b0276bf4a04a46e2fe0d8d414f6" member_type: VOTER last_known_addr { host: "data-node-3.xxx.com" port: 7050 } } peers { permanent_uuid: "2fe8aae066c1491788f5044a8db31524" member_type: VOTER last_known_addr { host: "cicloud2.xxx.com" port: 7050 } } Apr 19, 10:40:32.162 AM INFO leader_election.cc:216 T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Requesting pre-vote from peer 2fe8aae066c1491788f5044a8db31524 Apr 19, 10:40:32.163 AM WARN leader_election.cc:272 T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: RPC error from VoteRequest() call to peer 2fe8aae066c1491788f5044a8db31524: Network error: Client connection negotiation failed: client connection to 192.168.1.XX:7050: connect: Connection refused (error 111) Apr 19, 10:40:32.163 AM INFO leader_election.cc:243 T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Election decided. Result: candidate lost. Apr 19, 10:40:32.164 AM INFO raft_consensus.cc:2315 T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 16.071s Apr 19, 10:40:32.164 AM INFO raft_consensus.cc:2061 T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Leader pre-election lost for term 6. Reason: could not achieve majority
To me it seems, I have lost some replication, but this should not impact in ingestion .
The flume logs says :
2018-04-19 11:24:18,414 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 185cf38d136f464697eed244dd6486ba for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001E800007E280000008, 0x0000001E800007E280000009) 2018-04-19 11:24:18,414 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 0c0b2283756d40f2a5205d5ff2ab1e73 for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001E800007E280000009, 0x0000001E800007E28000000A) 2018-04-19 11:24:18,414 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 6f32baa5b5b545d5814248266e696407 for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001E800007E28000000A, 0x0000001E800007E28000000B) 2018-04-19 11:24:18,414 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 6a2d7a4aeb3e44e0bdf0b2a07f390899 for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001E800007E28000000B, 0x0000001E800007E28000000C) 2018-04-19 11:24:18,414 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 6bdfbae1836b454d8bbf670d63b3d52f for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001E800007E28000000C, 0x0000001E800007E28000000D) 2018-04-19 11:24:18,415 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 51808b822e0d4f988c99cd66ec7919b1 for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001F800007E18000000A, 0x0000001F800007E18000000B)
Please suggest !
Regards,
Kundan
Created 04-19-2018 04:17 PM
It's hard to tell from just this log snippet if anything really bad is happening, but for sure you have a tablet server that seems to be gone (192.168.1.XX:7050: connect: Connection refused) during that time.
Is ksck returning all ok? Do you have a range partition that covers the time range you are writing into?
Created 04-23-2018 02:26 AM
This table is monthly range partitioned with day as hash partition.
This seems to had happen due to load on server during the specific time.
https://issues.apache.org/jira/browse/KUDU-1078
I had to create a new table and delete previous one to get data insertion started, but this has happened again with a different table, providing ksck logs:
Tablet 5e5edc9ab5f84dbe8ef3f1e650057cd1 of table 'impala::vertoz.tbl_bid_winimp' is under-replicated: 1 replica(s) not RUNNING 080aabc3c7c94e178e107f09b7959c15 (data-node-4.vrtzads.com:7050): RUNNING 2fe8aae066c1491788f5044a8db31524 (cicloud2.ny1vrtzads.com:7050): bad state State: FAILED Data state: TABLET_DATA_READY Last status: Incomplete: Unable to load Consensus metadata: Could not read header for proto container file /home/kudu/tserver/consensus-meta/5e5edc9ab5f84dbe8ef3f1e650057cd1: File size not large enough to be valid: Proto container file /home/kudu/tserver/consensus-meta/5e5edc9ab5f84dbe8ef3f1e650057cd1: Tried to read 16 bytes at offset 0 but file size is only 0 bytes 9d754b0276bf4a04a46e2fe0d8d414f6 (data-node-3.vrtzads.com:7050): RUNNING [LEADER] 2 replicas' active configs differ from the master's. All the peers reported by the master and tablet servers are: A = 080aabc3c7c94e178e107f09b7959c15 B = 2fe8aae066c1491788f5044a8db31524 C = 9d754b0276bf4a04a46e2fe0d8d414f6 The consensus matrix is: Config source | Voters | Current term | Config index | Committed? ---------------+------------------------+--------------+--------------+------------ master | A B C* | | | Yes A | A B C | 5 | -1 | Yes B | [config not available] | | | C | B C | 5 | 7 | No Table impala::vertoz.tbl_bid_winimp has 17 under-replicated tablet(s) WARNING: 2 out of 7 table(s) are not in a healthy state ================== Errors: ================== table consistency check error: Corruption: 2 table(s) are bad FAILED Runtime error: ksck discovered errors