I want to to configure Impala to get as much performance as possible for executing analytics queries on Kudu. I may use 70-80% of my cluster resources. I looked at the advanced flags in both Kudu and Impala. Some of them didn't make sense to me and couldn't find much resources on the internet that describe them. Can any body suggest me an optimal configurations to achieve this?
I have 15 datanodes each with 16 cores, 128 GB Ram and10x1 TB hard disk. I also have to 3 separate servers for master nodes and other services ( each with16 cores and 256 GB Ram). I would appreciate any suggestions.
We generally try to make the default Impala configuration as good as possible to minimise tuning - there aren't really any --go_fast=true flags you can enable.
Usually the main setup decisions are about how to allocate memory between services. Impala often like lots of memory, particularly if you're running complex queries on lots of data with many joins. If it doesn't have enough memory it may end up spilling data to disk and running more slowly (or with the queries failing with "out of memory" in some cases). We have some docs about how to configure this with Cloudera Manager: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_howto_rm.html
The main things you can do to improve perf are to set up your data and query workloads right. There are some tips here here but a lot of them are specific to HDFS: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_perf_cookbook.html
Someone else may be able to comment in more detail about Kudu.
Thanks for answering Tim. I am not really expecting such a golden bullet flag. Can you please explain about following flags and their affects on the Impala performance?
Impala 2.9 has several Impala-Kudu performance improvements.
IMPALA-4859 - Push down IS NULL / IS NOT NULL to Kudu
IMPALA-3742 - INSERTs into Kudu tables should partition and sort
IMPALA-5156 - Drop VLOG level passed into Kudu client - "In some simple concurrency testing, Todd found that reducing the vlog level resulted in an increase in throughput from ~17 qps to 60qps."
make sure you have a large enough MEM_LIMIT and limit the number of joins in your queries. Goodluck 🙂
I hope my response didn't come across as facetious. There are a lot of database products on the market that *do* ship with suboptimal configurations or require a lot of tuning. With Impala we do try to avoid that, by designing features so that they're not overly sensitive to tuning parameters and by choosing default values that give good performance.
My main advice for tuning Impala is just to make sure that it has enough memory to execute all of the queries in your workload in memory. And run "compute stats" on your tables to help make sure that you get good execution plans.
I wouldn't recommend changing any of those flags - they're mostly just safety valves for rare cases where the defaults cause unanticipated problems. The only one that directly relates to kudu is --kudu_mutation_buffer_size, which controls the amount of memory used in the kudu client for buffering inserts/updates. --kudu_sink_mem_required should be updated in sync with --kudu_mutation_buffer_size so that it's 2x.