Reply
New Contributor
Posts: 2
Registered: ‎07-24-2017

Kudu cluster under heavy insertion dones't perform as expected

I have a 12 machine cluster. 2 machines are used as writers and Kudu masters, and 10 machines are used as tablet servers.

Every machine has:

11 1TB disk, 250GB memory, 40 cpu cores.

 

Writers:

Code written with Java Kudu client v.1.3.0 and I am using KuduClient to connect to the cluster.

Running the writers with 32gb memory (-Xmx32G)

Configuration, code, etc:

key constructed of 7 fields out of ~40.

All key fields are range partitioned, 2 keys are hash partitioned:

Definision of fields and creating the table:

List<String> rangeKeys = new ArrayList<>();
for (String key : primaryKeys) {
rangeKeys.add(key);
}

List<String> hashKeys = new ArrayList<>();
hashKeys.add("srcip");
hashKeys.add("dstip");

Schema schema = new Schema(columns);

CreateTableOptions tableOptions = new CreateTableOptions().setRangePartitionColumns(rangeKeys).addHashPartitions(hashKeys, 110);
client.createTable(tableName, schema, tableOptions);

Every writer has 30 threads. There is only one client instantiation and every threads creats a session and sets flush mode:

session = client.newSession();
session.setFlushMode(SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND);

Since the writers are constantly writing, the session is never closed.

Inserting to the table:

Insert insert = table.newInsert();
PartialRow row = insert.getRow();
//Fill the row.
session.apply(insert)

 

Tablets:

data stored on 11 disks

/disk[1-11]/kudu/data

maintenance_manager_num_threads=8

memory_limit_hard_bytes=32GB

memory_limit_soft_percentage=60

block_cache_capacity_mb=32GB

default_num_replicas=3

 

each tablet server has 32-35 tablets (checked via the management page) some as leader and some as follower.

tablet size after some time of ingestion 400-700mb. (checked over several hosts).

 

 

The Issues:

1. The insertion rate is too low - at the beginning of the ingestion ~40K rows for the entire cluster.

2. The insertion rate decreases and load is not equaly spread over the machines.

characteristics graphs of the tablet servers only:

Screen Shot 2017-07-24 at 12.26.59.pngScreen Shot 2017-07-24 at 12.28.11.pngScreen Shot 2017-07-24 at 12.28.22.pngScreen Shot 2017-07-24 at 12.28.27.pngScreen Shot 2017-07-24 at 12.28.50.png

 

Errors from the tablet servers logs:

The error I see on both the high and low utilization machines:

W0724 04:31:51.197305 39346 tablet_service.cc:753] Rejecting Write request: Soft memory limit exceeded (at 100.00% of capacity) [suppressed 929 similar messages]
W0724 04:31:51.199170 39392 raft_consensus.cc:1186] Rejecting consensus request: Soft memory limit exceeded (at 100.00% of capacity) [suppressed 62 similar messages]
I0724 04:31:51.228680 39450 maintenance_manager.cc:372] P e102744a286340b083b0fa45cebe5658: we have exceeded our soft memory limit (current capacity is 100.00%).  However, there are no ops currently runnable which would free memory.
W0724 04:31:51.274428 39339 consensus_peers.cc:357] T b2083caebfc64ace8e9394d57c93929b P e102744a286340b083b0fa45cebe5658 -> Peer 884486678fd04e2b9d2d75e97c1fa33e (<Machine4>:7050): Couldn't send request to peer 884486678fd04e2b9d2d75e97c1fa33e for tablet b2083caebfc64ace8e9394d57c93929b. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 100.01% of capacity). Retrying in the next heartbeat period. Already tried 98430 times.
W0724 04:31:51.351434 39338 consensus_peers.cc:357] T 07bb7f42e5604d25860b953c295e07ca P e102744a286340b083b0fa45cebe5658 -> Peer e109e575612a456bbf029f136a654c1b (<Machine9>:7050): Couldn't send request to peer e109e575612a456bbf029f136a654c1b for tablet 07bb7f42e5604d25860b953c295e07ca. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 100.00% of capacity). Retrying in the next heartbeat period. Already tried 98906 times.
W0724 04:31:51.398777 39339 consensus_peers.cc:357] T 3034310c9b834b61a98f0a8cfb726d83 P e102744a286340b083b0fa45cebe5658 -> Peer 884486678fd04e2b9d2d75e97c1fa33e (<Machine4>:7050): Couldn't send request to peer 884486678fd04e2b9d2d75e97c1fa33e for tablet 3034310c9b834b61a98f0a8cfb726d83. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 100.01% of capacity). Retrying in the next heartbeat period. Already tried 110022 times.
W0724 04:31:51.411522 39339 consensus_peers.cc:357] T cf19c2f0bc1a4d88b79e8cb9cec3275b P e102744a286340b083b0fa45cebe5658 -> Peer 4fcfb3d7897c43eea33b7cab7c5ff22c (<Machine8>:7050): Couldn't send request to peer 4fcfb3d7897c43eea33b7cab7c5ff22c for tablet cf19c2f0bc1a4d88b79e8cb9cec3275b. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 100.00% of capacity). Retrying in the next heartbeat period. Already tried 106467 times.
W0724 04:31:51.460232 39339 consensus_peers.cc:357] T f1cef98dfaab46b885919bdfce8012b1 P e102744a286340b083b0fa45cebe5658 -> Peer 884486678fd04e2b9d2d75e97c1fa33e (<Machine4>:7050): Couldn't send request to peer 884486678fd04e2b9d2d75e97c1fa33e for tablet f1cef98dfaab46b885919bdfce8012b1. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 100.00% of capacity). Retrying in the next heartbeat period. Already tried 105370 times.
I0724 04:31:51.480481 39450 maintenance_manager.cc:372] P e102744a286340b083b0fa45cebe5658: we have exceeded our soft memory limit (current capacity is 100.00%).  However, there are no ops currently runnable which would free memory.
W0724 04:31:51.582916 39339 consensus_peers.cc:357] T 38a8b8e4e4f842eb9b0d987594fe8819 P e102744a286340b083b0fa45cebe5658 -> Peer d1a1602d7910445eb8394b0be813aa71 (<Machine9>:7050): Couldn't send request to peer

 

Any ideas on what other changes need to be made? what other should be tested or what can be the issue?

New Contributor
Posts: 2
Registered: ‎07-24-2017

Re: Kudu cluster under heavy insertion dones't perform as expected

Update:

I increased the memory of the tablet serversro 200GB

 

the writing rate remains the same. for some reason it can't get over the magic number 20,000 rows per second.

 

CPU and writes rate remained the same.

 

New Contributor
Posts: 2
Registered: ‎01-11-2018

Re: Kudu cluster under heavy insertion dones't perform as expected

Hello @AntonPuz. Did you find the reason of this issue? I have a similar situation. I could achieve different insertion rate changing the number of rows sent per request:

session.setMutationBufferSpace(batchSize);

 

 However, I'm not able to find the bottleneck of the system.

Announcements