Support Questions

kundansonuj · ‎04-19-2018

Hi,

I am using kudu as sink to flume, to insert data to multiple tables. One of these tables has stopped ingesting records suddenly from today, surprisingly right at 00 Hours (I store data with hour column and day before data is matching exactly with other source). I checked, range partition is present, active leader to tablet server is running, data format has not been changed, but there seems to be issue with leader election. Here are the logs for reference:

Apr 19, 10:40:31.046 AM	INFO	multi_column_writer.cc:89	Opened CFile writers for 53 column(s)
			
Apr 19, 10:40:31.363 AM	INFO	raft_consensus.cc:412	T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout)
			
Apr 19, 10:40:31.363 AM	INFO	raft_consensus.cc:2315	T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 5.152s
			
Apr 19, 10:40:31.363 AM	INFO	raft_consensus.cc:436	T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election with config: opid_index: 6 OBSOLETE_local: false peers { permanent_uuid: "2fe8aae066c1491788f5044a8db31524" member_type: VOTER last_known_addr { host: "cicloud2.xxx.com" port: 7050 } } peers { permanent_uuid: "9d754b0276bf4a04a46e2fe0d8d414f6" member_type: VOTER last_known_addr { host: "data-node-3.xxx.com" port: 7050 } }
			
Apr 19, 10:40:31.364 AM	INFO	leader_election.cc:216	T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Requesting pre-vote from peer 2fe8aae066c1491788f5044a8db31524
			
Apr 19, 10:40:31.365 AM	WARN	leader_election.cc:272	T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: RPC error from VoteRequest() call to peer 2fe8aae066c1491788f5044a8db31524: Network error: Client connection negotiation failed: client connection to 192.168.1.XX:7050: connect: Connection refused (error 111)
			
Apr 19, 10:40:31.365 AM	INFO	leader_election.cc:243	T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Election decided. Result: candidate lost.
			
Apr 19, 10:40:31.366 AM	INFO	raft_consensus.cc:2315	T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 7.498s
			
Apr 19, 10:40:31.366 AM	INFO	raft_consensus.cc:2061	T 5960af8cab324245a741e2b064c4b452 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Leader pre-election lost for term 6. Reason: could not achieve majority
			
Apr 19, 10:40:31.583 AM	INFO	raft_consensus.cc:1396	T 026299227afe42aab3a604e0b6b26dd9 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 6 FOLLOWER]: Handling vote request from an unknown peer 080aabc3c7c94e178e107f09b7959c15
			
Apr 19, 10:40:31.583 AM	INFO	raft_consensus.cc:1847	T 026299227afe42aab3a604e0b6b26dd9 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 6 FOLLOWER]: Leader pre-election vote request: Denying vote to candidate 080aabc3c7c94e178e107f09b7959c15 for term 7 because replica has last-logged OpId of term: 6 index: 10, which is greater than that of the candidate, which has last-logged OpId of term: 6 index: 9.
			
Apr 19, 10:40:31.952 AM	INFO	raft_consensus.cc:412	T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout)
			
Apr 19, 10:40:31.952 AM	INFO	raft_consensus.cc:2315	T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 7.362s
			
Apr 19, 10:40:31.953 AM	INFO	raft_consensus.cc:436	T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election with config: opid_index: 7 OBSOLETE_local: false peers { permanent_uuid: "9d754b0276bf4a04a46e2fe0d8d414f6" member_type: VOTER last_known_addr { host: "data-node-3.xxx.com" port: 7050 } } peers { permanent_uuid: "2fe8aae066c1491788f5044a8db31524" member_type: VOTER last_known_addr { host: "cicloud2.xxx.com" port: 7050 } }
			
Apr 19, 10:40:31.953 AM	INFO	leader_election.cc:216	T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Requesting pre-vote from peer 2fe8aae066c1491788f5044a8db31524
			
Apr 19, 10:40:31.954 AM	WARN	leader_election.cc:272	T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: RPC error from VoteRequest() call to peer 2fe8aae066c1491788f5044a8db31524: Network error: Client connection negotiation failed: client connection to 192.168.1.XX:7050: connect: Connection refused (error 111)
			
Apr 19, 10:40:31.954 AM	INFO	leader_election.cc:243	T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Election decided. Result: candidate lost.
			
Apr 19, 10:40:31.955 AM	INFO	raft_consensus.cc:2315	T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 19.615s
			
Apr 19, 10:40:31.955 AM	INFO	raft_consensus.cc:2061	T 5e5edc9ab5f84dbe8ef3f1e650057cd1 P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Leader pre-election lost for term 6. Reason: could not achieve majority
			
Apr 19, 10:40:32.162 AM	INFO	raft_consensus.cc:412	T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout)
			
Apr 19, 10:40:32.162 AM	INFO	raft_consensus.cc:2315	T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 10.463s
			
Apr 19, 10:40:32.162 AM	INFO	raft_consensus.cc:436	T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Starting pre-election with config: opid_index: 9 OBSOLETE_local: false peers { permanent_uuid: "9d754b0276bf4a04a46e2fe0d8d414f6" member_type: VOTER last_known_addr { host: "data-node-3.xxx.com" port: 7050 } } peers { permanent_uuid: "2fe8aae066c1491788f5044a8db31524" member_type: VOTER last_known_addr { host: "cicloud2.xxx.com" port: 7050 } }
			
Apr 19, 10:40:32.162 AM	INFO	leader_election.cc:216	T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Requesting pre-vote from peer 2fe8aae066c1491788f5044a8db31524
			
Apr 19, 10:40:32.163 AM	WARN	leader_election.cc:272	T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: RPC error from VoteRequest() call to peer 2fe8aae066c1491788f5044a8db31524: Network error: Client connection negotiation failed: client connection to 192.168.1.XX:7050: connect: Connection refused (error 111)
			
Apr 19, 10:40:32.163 AM	INFO	leader_election.cc:243	T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [CANDIDATE]: Term 6 pre-election: Election decided. Result: candidate lost.
			
Apr 19, 10:40:32.164 AM	INFO	raft_consensus.cc:2315	T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Snoozing failure detection for election timeout plus an additional 16.071s
			
Apr 19, 10:40:32.164 AM	INFO	raft_consensus.cc:2061	T cf88f2e57cab42e19c98e41e27213bfd P 9d754b0276bf4a04a46e2fe0d8d414f6 [term 5 FOLLOWER]: Leader pre-election lost for term 6. Reason: could not achieve majority

To me it seems, I have lost some replication, but this should not impact in ingestion .
The flume logs says :

2018-04-19 11:24:18,414 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 185cf38d136f464697eed244dd6486ba for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001E800007E280000008, 0x0000001E800007E280000009)
2018-04-19 11:24:18,414 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 0c0b2283756d40f2a5205d5ff2ab1e73 for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001E800007E280000009, 0x0000001E800007E28000000A)
2018-04-19 11:24:18,414 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 6f32baa5b5b545d5814248266e696407 for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001E800007E28000000A, 0x0000001E800007E28000000B)
2018-04-19 11:24:18,414 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 6a2d7a4aeb3e44e0bdf0b2a07f390899 for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001E800007E28000000B, 0x0000001E800007E28000000C)
2018-04-19 11:24:18,414 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 6bdfbae1836b454d8bbf670d63b3d52f for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001E800007E28000000C, 0x0000001E800007E28000000D)
2018-04-19 11:24:18,415 INFO org.apache.kudu.client.AsyncKuduClient: Learned about tablet 51808b822e0d4f988c99cd66ec7919b1 for table 'impala::vertoz.tbl_bidder_request' with partition [0x0000001F800007E18000000A, 0x0000001F800007E18000000B)

Please suggest !
Regards,
Kundan

J-D · ‎04-19-2018

It's hard to tell from just this log snippet if anything really bad is happening, but for sure you have a tablet server that seems to be gone (192.168.1.XX:7050: connect: Connection refused) during that time.

Is ksck returning all ok? Do you have a range partition that covers the time range you are writing into?

kundansonuj · ‎04-23-2018

This table is monthly range partitioned with day as hash partition.
This seems to had happen due to load on server during the specific time.
https://issues.apache.org/jira/browse/KUDU-1078

I had to create a new table and delete previous one to get data insertion started, but this has happened again with a different table, providing ksck logs:

Tablet 5e5edc9ab5f84dbe8ef3f1e650057cd1 of table 'impala::vertoz.tbl_bid_winimp' is under-replicated: 1 replica(s) not RUNNING
  080aabc3c7c94e178e107f09b7959c15 (data-node-4.vrtzads.com:7050): RUNNING
  2fe8aae066c1491788f5044a8db31524 (cicloud2.ny1vrtzads.com:7050): bad state
    State:       FAILED
    Data state:  TABLET_DATA_READY
    Last status: Incomplete: Unable to load Consensus metadata: Could not read header for proto container file /home/kudu/tserver/consensus-meta/5e5edc9ab5f84dbe8ef3f1e650057cd1: File size not large enough to be valid: Proto container file /home/kudu/tserver/consensus-meta/5e5edc9ab5f84dbe8ef3f1e650057cd1: Tried to read 16 bytes at offset 0 but file size is only 0 bytes
  9d754b0276bf4a04a46e2fe0d8d414f6 (data-node-3.vrtzads.com:7050): RUNNING [LEADER]

2 replicas' active configs differ from the master's.
  All the peers reported by the master and tablet servers are:
  A = 080aabc3c7c94e178e107f09b7959c15
  B = 2fe8aae066c1491788f5044a8db31524
  C = 9d754b0276bf4a04a46e2fe0d8d414f6

The consensus matrix is:
 Config source |         Voters         | Current term | Config index | Committed?
---------------+------------------------+--------------+--------------+------------
 master        | A   B   C*             |              |              | Yes
 A             | A   B   C              | 5            | -1           | Yes
 B             | [config not available] |              |              | 
 C             |     B   C              | 5            | 7            | No
Table impala::vertoz.tbl_bid_winimp has 17 under-replicated tablet(s)

WARNING: 2 out of 7 table(s) are not in a healthy state
==================
Errors:
==================
table consistency check error: Corruption: 2 table(s) are bad

FAILED
Runtime error: ksck discovered errors

Cloudera Community

Support Questions

KUDU: RPC Error from Vote Request