When using impala 2.10, I have observed a concurency issues on queries disks bound (=reading a lot of data and doing few computation). Response time for those queries increase linearly (2 queries are 2 time slower, 10 queries are 10 times slower) and changing the number of scanning threads do not change the performances. Of course, the limitation is well below the cluster I/O capacity.
So I have two questions:
How do we know the maximum number of scanning threads per query, as the documentation mention a maximum without precising it:
Is there a global limit or a global pool of scanning threads? So that the more queries in flights, the less scanning threads they can use?
@JulienMaria I think the docs you are reffering to explains it quite well. Impala by default uses as many scanning threads as many cores (in HT virtual cores) it has on each Impala Daemon node. So for example if you have 16 core servers, every server will use 16 scanners to get the data for every query. So in case you will run 2 queries at the same time, it will use 32 scanners.
But there are some cases where you dont want to allow your query to use all the cores for scanning, so you lower this limit. You cannot increase it for higher value, just decrease.
At least this is my understanding of the documentation..
@Tomas79yeah that's roughly correct. The scanner threads are limited per-scan to the number of cores and globally to 3x the number of cores. The number of scanner threads created depends on whether memory is available and whether there's enough scan ranges (i.e. input files) in order to parallelise the scan.
Are you saying that the queries are neither CPU-bound or IO bound? In most cases we see Impala hit one or the other limit - either it ends up being limited by CPU - since scanning can be quite CPU-intensive or it's limited by I/O.