Support Questions

Find answers, ask questions, and share your expertise

kudu soft memory error

avatar
Expert Contributor

Hi ,

I am doing upsert in kudu table and getting error 

Kudu error(s) reported, first error: Timed out: Failed to write batch of 227 ops to tablet 8b19e4a0362e4b82941e54d33ac9c5a2 after 1 attempt(s): Failed to write to server: b2ead65ab0164f5b8db24d700a2c474a (wewcw0hd3dn02.example.com:7050): Write RPC to 10.11.100.85:7050 timed out after 179.977s (SENT)

 

when i checked log of tablet server i found 

 

W0623 11:30:58.060768 13885 consensus_peers.cc:357] T 4c8c4a5bcdc24f9a87ec5800818ca937 P b2ead65ab0164f5b8db24d700a2c474a -> Peer 6e748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 4c8c4a5bcdc24f9a87ec5800818ca937. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 98.77% of capacity). Retrying in the next heartbeat period. Already tried 32 times.
W0623 11:30:58.318686 13885 consensus_peers.cc:357] T 6bb3bdb188b44048bcddd70db21158dc P b2ead65ab0164f5b8db24d700a2c474a -> Peer 6e748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 6bb3bdb188b44048bcddd70db21158dc. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 96.85% of capacity). Retrying in the next heartbeat period. Already tried 40 times.

 

intitally i thought it is because of batchsize , i used the batch size 10000 and also  doubled memory hard limit in kudu to 2GB but still same error is coming .

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hi,

 

2GB is still on the small side, especially if you have a lot of tablets. Here's the recommendation: http://kudu.apache.org/docs/known_issues.html#_server_management

 

Also, which version of Kudu? How many data dirs? How many maintenance manager threads?

 

Thanks,

 

J-D

View solution in original post

9 REPLIES 9

avatar
Expert Contributor

Hi,

 

2GB is still on the small side, especially if you have a lot of tablets. Here's the recommendation: http://kudu.apache.org/docs/known_issues.html#_server_management

 

Also, which version of Kudu? How many data dirs? How many maintenance manager threads?

 

Thanks,

 

J-D

avatar
Expert Contributor

 

Ok , i have increased memory to 10GB .kudu version 1.3.0-1.cdh5.11.0.p0.12

one dir per tablet server  ,total 4 tablet server 

 

 

748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 0bcada0b3ff54885a5db5d234f51bc10. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next heartbeat period. Already tried 53 times.
W0623 12:13:45.419793 22784 consensus_peers.cc:357] T 4a9cc150758744a0a2fd477c5dcb7ff3 P 63b4ef88fb84431ea93a79304c3b9bb8 -> Peer 6e748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 4a9cc150758744a0a2fd477c5dcb7ff3. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next heartbeat period. Already tried 43 times.
W0623 12:13:45.453810 22784 consensus_peers.cc:357] T 4fdce491787a4be3b4e0cef44107b191 P 63b4ef88fb84431ea93a79304c3b9bb8 -> Peer b2ead65ab0164f5b8db24d700a2c474a (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer b2ead65ab0164f5b8db24d700a2c474a for tablet 4fdce491787a4be3b4e0cef44107b191. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next heartbeat period. Already tried 29 times.
W0623 12:13:45.498914 22784 consensus_peers.cc:357] T 4c8c4a5bcdc24f9a87ec5800818ca937 P 63b4ef88fb84431ea93a79304c3b9bb8 -> Peer 6e748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 4c8c4a5bcdc24f9a87ec5800818ca937. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next heartbeat period. Already tried 263 times.

 

avatar
Expert Contributor
but all tablet servers are up and running

avatar
Expert Contributor

Capture3.PNG

avatar
Expert Contributor

Ok, with 1 disk you wouldn't want more than 1 maintenance manager threads.

 

The log lines you showed can mean that the tablet server is still starting up. Depending on the amount of data and the number of tablets, it can take minutes. You can also run "kudu cluster ksck <master_addresses" to see what's up the all the tablets.

avatar
Expert Contributor

Still same error

 

W0623 12:29:44.577033 4551 consensus_peers.cc:357] T 3b475f2c54774d50b88a5e38249d758b P 6e748f9ff1944b5490f23631bcd351da -> Peer 63b4ef88fb84431ea93a79304c3b9bb8 (wewcw0hd3dn01.example.com:7050): Couldn't send request to peer 63b4ef88fb84431ea93a79304c3b9bb8 for tablet 3b475f2c54774d50b88a5e38249d758b. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 87.59% of capacity). Retrying in the next heartbeat period. Already tried 5 times.

 

W0623 12:32:16.081887 22784 consensus_peers.cc:357] T 729d5f7c4678430fa7a99f3fbb76e9a7 P 63b4ef88fb84431ea93a79304c3b9bb8 -> Peer 6e748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 729d5f7c4678430fa7a99f3fbb76e9a7. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 91.41% of capacity). Retrying in the next heartbeat period. Already tried 3 times.

 

but only these two causing this error 

wewcw0hd3dn01.example.com:7050

wewcw0hd3dn04.example.com:7050

 

four tablet servers are 

 

wewcw0hd3dn01.example.com:7050

wewcw0hd3dn02.example.com:7050

wewcw0hd3dn03.example.com:7050

wewcw0hd3dn04.example.com:7050

 

[root@wewcw0hd3dn03 ~]# kudu cluster ksck wewcw0hd3dn03.example.com
Connected to the Master
Fetched info from all 4 Tablet Servers
Table impala::compliance.stage_kudu_cp_country_flag is HEALTHY (16 tablet(s) checked)

Table impala::compliance.stage_kudu_cp_address is HEALTHY (16 tablet(s) checked)

Table impala::compliance.stage_kudu_consumer_profile_entities is HEALTHY (16 tablet(s) checked)

Table impala::compliance.stage_kudu_cp_account is HEALTHY (16 tablet(s) checked)

Table impala::compliance.stage_kudu_cp_phone is HEALTHY (16 tablet(s) checked)

Table impala::compliance.stage_kudu_consumer_profile is HEALTHY (16 tablet(s) checked)

Table impala::compliance.stage_kudu_transaction_master is HEALTHY (32 tablet(s) checked)

Table impala::compliance.stage_kudu_cp_id is HEALTHY (32 tablet(s) checked)

Table impala::compliance.my_first_table is HEALTHY (16 tablet(s) checked)

The metadata for 9 table(s) is HEALTHY
OK

avatar
Expert Contributor

Ok so it seems the things you previously pasted were resolved when your Kudu cluster finished booting up.

 

When you say "same error", you mean on the client side you're still getting timeouts? I'm asking because what you're showing after that are warnings, not errors, and it's perfectly normal to get memory pushback when servers are too busy.

 

Now this can either mean two things: you're trying to insert faster than Kudu can ingest it based on the resources (disks/ram/cpu) it's given, or there's something wrong with flushing and it's too slow. Having a much bigger log snippet from wewcw0hd3dn01 or wewcw0hd3dn04 would help.

avatar
Expert Contributor
memory increased worked , i didnt get timeout .
and that soft memory warning we can ignore ?

avatar
Hi, is this the only solution for this kind of error - increasing the memory limit? Thannks