We're facing with the instability of Kudu.
We run map-reduce jobs, where mappers read from Kudu, process data, pass to reducers and reducers write to Kudu.
Sometimes mappers fail with "Exception running child : java.io.IOException: Couldn't get scan data", caused by "<tablet_id> pretends to not know KuduScanner" (see mapper.txt in the link below). It happens with multiple attempts as well, which resulted in job failure.
The environment is:
3 masters 15 tservers.
Here is a failure example, which happened at 2018-08-27 10:26:41. This time there was also a restart of one of tservers.
At that time at kudu tablet servers are observed multiple requests with backpressure and consensus loss (see attached files from 3 nodes, where replicas were placed). The logs on other tablets were removed, in the logs there are some minutes before and after the failure.
node07 - https://expirebox.com/download/9be42eeb88a367639e207d0c148e6e09.html
node12 - https://expirebox.com/download/0e021bd7fd929b9bd585e4e995729994.html
node13 - https://expirebox.com/download/db31c5ac0305f18b6ef0e2171e2d034c.html
Kudu leader at that time - https://expirebox.com/download/f24cd185e2bb4889dbc18b87c70fc4c8.html
Limitations are in shape, except for tablets per server - currently a few have ~5000 tablets per server, others less.
There are powerful enough servers with reserved capacity, so looking at metrics there are no anomalies/peaks in CPU, RAM, disk I/O, network.
Side note: from time to time there appear "slow DNS" messages, where "real" may exceed limits (5s), but "user" and "system" are in good shape. Some time ago there were attempts to lookup DNS locally, but without positive effect. Still I don't expect this to be the root cause.
Any suggestions how to tune the configuration is welcome as well.
IP - Hostname - Role - tserver ID 10.250.250.11 - nodemaster01 - Kudu Master001 10.250.250.12 - nodemaster02 - Kudu Master002 10.250.250.13 - node01 - Kudu Master003 + Kudu Tablet Server 10.250.250.14 - node02 - Kudu Tablet Server ... 10.250.250.19 - node07 - Kudu Tablet Server - e8a1d4cacb7f4f219fd250704004d258 ... 10.250.250.24 - node12 - Kudu Tablet Server - 3a5ee46ab1284f4e9d4cdfe5d0b7f7fa 10.250.250.25 - node13 - Kudu Tablet Server - 9c9e4296811a47e4b1709d93772ae20b ...
2018-08-27 10:26:41,010 WARN [New I/O worker #177] org.apache.kudu.client.AsyncKuduScanner: 5e52666b9ecf4460808b387a7bfe67d6 pretends to not know KuduScanner(table=impala::table_name.archive, tablet=null, scannerId="5e93cd2a022a467bb9f1d7c5b45dee80", scanRequestTimeout=30000) org.apache.kudu.client.NonRecoverableException: Scanner not found
The exception above indicates that a scanner on tablet 5e52666b9ecf4460808b387a7bfe67d6 wasn't found. This is likely because the scanner expired because it was inactive for longer than its ttl, which is set (server-wide) by the flag --scanner_ttl_ms and defaults to 60s. It's not clear why the scanner was inactive for that long. Does your application sometimes take a long time between receiving batches? I'm not ruling out an issue on the Kudu side, though.
Looking at the log for e8a1d4cacb7f4f219fd250704004d258 (node 07), that tablet is struggling with consensus instability:
W0827 10:27:11.563266 18970 leader_election.cc:287] T 5e52666b9ecf4460808b387a7bfe67d6 P e8a1d4cacb7f4f219fd250704004d258 [CANDIDATE]: Term 1968 pre-election: RPC error from VoteRequest() call to peer 3a5ee46ab1284f4e9d4cdfe5d0b7f7fa: Remote error: Service unavailable: RequestConsensusVote request on kudu.consensus.ConsensusService from 10.250.250.19:37042 dropped due to backpressure. The service queue is full; it has 50 items. W0827 10:27:11.564280 18970 leader_election.cc:287] T 5e52666b9ecf4460808b387a7bfe67d6 P e8a1d4cacb7f4f219fd250704004d258 [CANDIDATE]: Term 1968 pre-election: RPC error from VoteRequest() call to peer 9c9e4296811a47e4b1709d93772ae20b: Network error: Client connection negotiation failed: client connection to 10.250.250.25:7050: connect: Connection refused (error 111)
Could you provide the output of ksck run on the cluster?
sudo -u kudu kudu cluster ksck <master addresses>
Also 5000 replicas on one tablet server is far beyond our guidelines and could lead to issues like this.
We've reduced the amount of tablets to 4-4.5k per tserver and added populated /etc/hosts - the frequency has significantly dropped (previously it happened sometimes for 3 map attempts in a row and failed the job, now it happens rarely and handled by the second attempt).
Application writes asynchronously, but shouldn't wait for such long. But I guess at OS level it still may be interrupted.
I haven't seen the scanner ID in that tservers' logs. However, previously there were cases some time ago, when scanner ID wasn't found error appeared right after scanner creation and nearly after 60 seconds at one of tservers appeared this scanners timeout message.
Regarding cluster check:
During the backpressure issue the tservers may become inavailable in kudu (including "cluster ksck") and consensus is lost. After some time kudu returns to normal operational state.
Are there any recommendations to reduce backpressure? Is it worth to increase stack from 50 items to larger?
Or maybe any recommendations to tune kudu for larger tablet amount?
Unfortunately you are already past the number of recommended / supported tablets per server, see https://www.cloudera.com/documentation/enterprise/latest/topics/kudu_limitations.html#scaling_limits
In general it will require engineering work to push past that limit and we don't have anyone working on it at the moment.
Yes, I understand that.
Unfortunately, the usecase dictates conditions we're hitting the limits:
1) small amount of large servers -> not so large number of tablets available
2) several dozens of systems * hundred of tables each * 3-50 tablets per table * replication factor -> quite large number of tablets required
May parameters tuning improve the situation with appearing backpressure, e.g. the default
Frankly it sounds like you should revisit your capacity planning.
You can try bumping up raft consensus timeouts and upgrading to the latest version of Kudu but it may not help that much.
reviewing the cluster size is in progress, but it takes long time. As it usually happen with on-premises hardware :-)
However, the tablet number most likely will be a bottleneck in future too, so any related performance improvement is worth.
I see there are improvements with optimization for deletion in 1.7.1, so yes, might be worth to consider too.
Raft consensus timeouts - probably may help to avoid avelanches, but it sounds to be rather as fighting vs consequences.
1. /etc/hosts is in place again after the issue appeared. It is the first point to check in the nswitch.conf
2. THP were disabled initially according to Cloudera recommendations.
3. Tablets were rebalanced. Thanks for the link - time to replace custom script.
Basically, reduction of tablets amount (but still above recommendations), rebalancing and population of /etc/hosts - these were the first action points and they helped to reduce the occurance significantly. But "slow dns lookup" and "couldn't get scanned data" still appear from time to time.
@Andreyeff Another thing you can try doing is increasing the raft heartbeat interval from 500ms to 1500ms or even 3000ms, see https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_raft_heartbeat_interval_ms
This will affect your recovery time by a few seconds if a leader fails since by default, elections don't happen for 3 missed heartbeat periods (controlled by https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_leader_failure_max_missed_hea... )
Some additional points, just in case: