Created on 06-23-2017 08:42 AM - edited 09-16-2022 04:48 AM
Hi ,
I am doing upsert in kudu table and getting error
Kudu error(s) reported, first error: Timed out: Failed to write batch of 227 ops to tablet 8b19e4a0362e4b82941e54d33ac9c5a2 after 1 attempt(s): Failed to write to server: b2ead65ab0164f5b8db24d700a2c474a (wewcw0hd3dn02.example.com:7050): Write RPC to 10.11.100.85:7050 timed out after 179.977s (SENT)
when i checked log of tablet server i found
W0623 11:30:58.060768 13885 consensus_peers.cc:357] T 4c8c4a5bcdc24f9a87ec5800818ca937 P b2ead65ab0164f5b8db24d700a2c474a -> Peer 6e748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 4c8c4a5bcdc24f9a87ec5800818ca937. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 98.77% of capacity). Retrying in the next heartbeat period. Already tried 32 times.
W0623 11:30:58.318686 13885 consensus_peers.cc:357] T 6bb3bdb188b44048bcddd70db21158dc P b2ead65ab0164f5b8db24d700a2c474a -> Peer 6e748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 6bb3bdb188b44048bcddd70db21158dc. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 96.85% of capacity). Retrying in the next heartbeat period. Already tried 40 times.
intitally i thought it is because of batchsize , i used the batch size 10000 and also doubled memory hard limit in kudu to 2GB but still same error is coming .
Created 06-23-2017 08:51 AM
Hi,
2GB is still on the small side, especially if you have a lot of tablets. Here's the recommendation: http://kudu.apache.org/docs/known_issues.html#_server_management
Also, which version of Kudu? How many data dirs? How many maintenance manager threads?
Thanks,
J-D
Created 06-23-2017 08:51 AM
Hi,
2GB is still on the small side, especially if you have a lot of tablets. Here's the recommendation: http://kudu.apache.org/docs/known_issues.html#_server_management
Also, which version of Kudu? How many data dirs? How many maintenance manager threads?
Thanks,
J-D
Created 06-23-2017 09:17 AM
Ok , i have increased memory to 10GB .kudu version 1.3.0-1.cdh5.11.0.p0.12
one dir per tablet server ,total 4 tablet server
748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 0bcada0b3ff54885a5db5d234f51bc10. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next heartbeat period. Already tried 53 times.
W0623 12:13:45.419793 22784 consensus_peers.cc:357] T 4a9cc150758744a0a2fd477c5dcb7ff3 P 63b4ef88fb84431ea93a79304c3b9bb8 -> Peer 6e748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 4a9cc150758744a0a2fd477c5dcb7ff3. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next heartbeat period. Already tried 43 times.
W0623 12:13:45.453810 22784 consensus_peers.cc:357] T 4fdce491787a4be3b4e0cef44107b191 P 63b4ef88fb84431ea93a79304c3b9bb8 -> Peer b2ead65ab0164f5b8db24d700a2c474a (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer b2ead65ab0164f5b8db24d700a2c474a for tablet 4fdce491787a4be3b4e0cef44107b191. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next heartbeat period. Already tried 29 times.
W0623 12:13:45.498914 22784 consensus_peers.cc:357] T 4c8c4a5bcdc24f9a87ec5800818ca937 P 63b4ef88fb84431ea93a79304c3b9bb8 -> Peer 6e748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 4c8c4a5bcdc24f9a87ec5800818ca937. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: NOT_STARTED. Retrying in the next heartbeat period. Already tried 263 times.
Created 06-23-2017 09:17 AM
Created 06-23-2017 09:19 AM
Created 06-23-2017 09:22 AM
Ok, with 1 disk you wouldn't want more than 1 maintenance manager threads.
The log lines you showed can mean that the tablet server is still starting up. Depending on the amount of data and the number of tablets, it can take minutes. You can also run "kudu cluster ksck <master_addresses" to see what's up the all the tablets.
Created 06-23-2017 09:34 AM
Still same error
W0623 12:29:44.577033 4551 consensus_peers.cc:357] T 3b475f2c54774d50b88a5e38249d758b P 6e748f9ff1944b5490f23631bcd351da -> Peer 63b4ef88fb84431ea93a79304c3b9bb8 (wewcw0hd3dn01.example.com:7050): Couldn't send request to peer 63b4ef88fb84431ea93a79304c3b9bb8 for tablet 3b475f2c54774d50b88a5e38249d758b. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 87.59% of capacity). Retrying in the next heartbeat period. Already tried 5 times.
W0623 12:32:16.081887 22784 consensus_peers.cc:357] T 729d5f7c4678430fa7a99f3fbb76e9a7 P 63b4ef88fb84431ea93a79304c3b9bb8 -> Peer 6e748f9ff1944b5490f23631bcd351da (wewcw0hd3dn04.example.com:7050): Couldn't send request to peer 6e748f9ff1944b5490f23631bcd351da for tablet 729d5f7c4678430fa7a99f3fbb76e9a7. Status: Remote error: Service unavailable: Soft memory limit exceeded (at 91.41% of capacity). Retrying in the next heartbeat period. Already tried 3 times.
but only these two causing this error
wewcw0hd3dn01.example.com:7050
wewcw0hd3dn04.example.com:7050
four tablet servers are
wewcw0hd3dn01.example.com:7050
wewcw0hd3dn02.example.com:7050
wewcw0hd3dn03.example.com:7050
wewcw0hd3dn04.example.com:7050
[root@wewcw0hd3dn03 ~]# kudu cluster ksck wewcw0hd3dn03.example.com
Connected to the Master
Fetched info from all 4 Tablet Servers
Table impala::compliance.stage_kudu_cp_country_flag is HEALTHY (16 tablet(s) checked)
Table impala::compliance.stage_kudu_cp_address is HEALTHY (16 tablet(s) checked)
Table impala::compliance.stage_kudu_consumer_profile_entities is HEALTHY (16 tablet(s) checked)
Table impala::compliance.stage_kudu_cp_account is HEALTHY (16 tablet(s) checked)
Table impala::compliance.stage_kudu_cp_phone is HEALTHY (16 tablet(s) checked)
Table impala::compliance.stage_kudu_consumer_profile is HEALTHY (16 tablet(s) checked)
Table impala::compliance.stage_kudu_transaction_master is HEALTHY (32 tablet(s) checked)
Table impala::compliance.stage_kudu_cp_id is HEALTHY (32 tablet(s) checked)
Table impala::compliance.my_first_table is HEALTHY (16 tablet(s) checked)
The metadata for 9 table(s) is HEALTHY
OK
Created 06-23-2017 09:42 AM
Ok so it seems the things you previously pasted were resolved when your Kudu cluster finished booting up.
When you say "same error", you mean on the client side you're still getting timeouts? I'm asking because what you're showing after that are warnings, not errors, and it's perfectly normal to get memory pushback when servers are too busy.
Now this can either mean two things: you're trying to insert faster than Kudu can ingest it based on the resources (disks/ram/cpu) it's given, or there's something wrong with flushing and it's too slow. Having a much bigger log snippet from wewcw0hd3dn01 or wewcw0hd3dn04 would help.
Created 06-23-2017 10:39 AM
Created 06-25-2018 12:58 AM