I am trying to figure out why all my 3 tablet servers run out of memory, but it's hard to do.
Configuration: 3 tablet servers, each has memory_limit_hard_bytes set to 8GB. All Kudu operations are performed via Impala JDBC.
Symptoms: INSERT operations fail with errors like:
Service unavailable: Soft memory limit exceeded (at 101.27% of capacity
Tablet web interface (http://<kudu-tablet-server>:8050/mem-trackers) shows that all the memory is consumed:
Another table on this web page displays details about memory consumption:
|Id||Parent||Limit||Current Consumption||Peak consumption|
However, when I try to sum up all entries in the "Current Consumption" column, I get much lower value: 2.6 GB. My question: who ate the remaining (8-2.6) = 5.4 GB?
How I sum up all the entries?
- parse the HTML code of the web page and extract the table with detailed memory consumption
- save this table a CSV file
- read it in spark-shell
- convert all memory consumption values to bytes (from GB, MB, KB)
- finally, calculate the sum for the "Current Consumption" and "Peak consumption" columns
The memory is being used somewhere other than in places where the usage is tracked by the memtrackers. The first possibility that comes to mind is that it is being used by the block manager. Try running
kudu fs check -fs_wal_dir=<wal dir> -fs_data_dirs=<data dirs>
and posting the output here. You may need to run the command as the kudu user (usually 'kudu').
Another helpful thing would be to gather heap samples. See https://kudu.apache.org/docs/troubleshooting.html#heap_sampling. If you attach the resulting SVG file it would be very helpful in locating the memory usage.
Now the tablet server is out of memory:
Here is the heap sample from the failing node: https://sendeyo.com/up/d/d26504e2ef
Here is the output of the "kudu fs check" command:
$ kudu fs check -fs_wal_dir=/data/kudu -fs_data_dirs=/data/kudu I0319 10:23:40.103991 157327 fs_manager.cc:260] Metadata directory not provided I0319 10:23:40.104055 157327 fs_manager.cc:263] Using existing metadata directory in first data directory W0319 10:23:40.104472 157327 data_dirs.cc:615] IO error: Could not lock /data/kudu/data/block_manager_instance: Could not lock /data/kudu/data/block_manager_instance: lock /data/kudu/data/block_manager_instance: Resource temporarily unavailable (error 11) W0319 10:23:40.104492 157327 data_dirs.cc:616] Proceeding without lock I0319 10:23:40.109122 157327 fs_manager.cc:397] Time spent opening directory manager: real 0.005s user 0.000s sys 0.000s I0319 10:23:40.109161 157327 env_posix.cc:1634] Not raising this process' open files per process limit of 65535; it is already as high as it can go I0319 10:23:40.109180 157327 file_cache.cc:470] Constructed file cache lbm with capacity 26214 I0319 10:23:50.317314 157332 log_block_manager.cc:2367] Opened 584 log block containers in /data/kudu/data I0319 10:24:00.467881 157332 log_block_manager.cc:2367] Opened 700 log block containers in /data/kudu/data I0319 10:24:10.991781 157332 log_block_manager.cc:2367] Opened 766 log block containers in /data/kudu/data I0319 10:24:20.999274 157332 log_block_manager.cc:2367] Opened 882 log block containers in /data/kudu/data I0319 10:24:25.138837 157332 log_block_manager.cc:2437] Read-only block manager, skipping repair I0319 10:24:25.146647 157327 fs_manager.cc:417] Time spent opening block manager: real 45.037s user 0.000s sys 0.000s I0319 10:24:25.146875 157327 fs_manager.cc:428] Opened local filesystem: /data/kudu uuid: "d8b982ee9d6348a29153e4540c6c425a" format_stamp: "Formatted at 2017-12-20 17:21:05 on ossvert4" Block manager report -------------------- 1 data directories: /data/kudu/data Total live blocks: 1933969 Total live bytes: 61130157258 Total live bytes (after alignment): 66793267200 Total number of LBM containers: 906 (354 full) Total missing blocks: 0 Total orphaned blocks: 514 (0 repaired) Total orphaned block bytes: 609251411 (0 repaired) Total full LBM containers with extra space: 37 (0 repaired) Total full LBM container extra space in bytes: 2317418496 (0 repaired) Total incomplete LBM containers: 0 (0 repaired) Total LBM partial records: 0 (0 repaired)
The heap sample contains a surprising entry: "kudu HdrHistogram Init 4502.0 (54.4%)". Maybe I miss something, but it looks like an utility to collect some usage metrics. Why it consumed more than 50% of the memory allocated to the tablet server?