Created on 12-06-2017 10:15 PM - edited 09-16-2022 05:36 AM
We have been testing LLAP concurrency for a couple months now. A little info on our test cluster below.
We can get results back pretty quick (under 2 seconds) below 12 concurrent queries. However, once it reaches 18 and above, we started seeing longer runtime and a lot of killed task attempts (see image). I believe it contributes to the longer rumtime. Is it enough info to figure out which configuration setup can avoid those killed tasks?
HDP version 2.6
Thanks.
Created 12-12-2017 01:11 AM
Yeah I suspect this is where cluster is reaching capacity. Killed task attempts are probably composed of two types: rejected task attempts because LLAP daemon is full and won't accept new work; and killed opportunistic non-finishable tasks (preemption). The latter happen because Hive starts some tasks (esp. reducers) before all inputs for them are ready, to be able to download the inputs from some upstream tasks while waiting for other upstream tasks to finish. When parallel queries want to run a task that can run and finish immediately, they would pre-empt non-finishable tasks (otherwise a task that is potentially doing nothing and waiting for something else to finish could take resouces from the tasks that are ready). This is normal with high volume concurrent queries that amount of preemption increases. The only way to check if there are any other (potentially problematic) kills now is to check the logs...
If cache is not as important for these queries you can try to reducehive.llap.task.scheduler.locality.delay, which may cause faster scheduling for tasks (-1 means infinite, the minimum otherwise is 0)..
However, once the cluster is at capacity, it's not realistic to expect sub-linear scaling... individual query runtime improvements would also improve aggregate runtime in this case.
Created 12-08-2017 07:56 PM
If 6 queries take 4 seconds, scaling linearly would mean 50 queries should take ~33 seconds if the cluster was at capacity. I'm assuming there's also some slack, however unless queries are literally 2-3 tasks, that probably means at some point the cluster reaches capacity and the tasks from different queries start queueing up (even if they do run faster due to cache, JIT, etc.). Are 54 executors per machine, or total?By 50% capacity do you mean 4 LLAP with 50% of a node, or 2 LLAPs with full node? In absence of other information 13sec (which is much better than linear scaling) actually looks pretty good. One would need to look at query details and especially cluster utilization with different number of queries to determine if one needs parallel query improvement, or (which is more likely) single-query improvement because the tasks are simply waiting for other tasks with cluster at capacity.
Also, 4 queries by default is just a reasonable default for most people starting out; parallel query capacity actually requires some resources (to have AMs present; otherwise AM startup will affect perf a lot), so we chose a value on the low end.
As for memory usage per query, the counters at the end and in Tez UI output cache/IO memory used, however Java memory use by execution is not tracked well right now.
Created 12-11-2017 11:35 PM
Thanks for answering my original questions, but I have to revise my questions because of a coding issue. The new result is actually better. I list the original questions below for reference.
Created 12-12-2017 01:11 AM
Yeah I suspect this is where cluster is reaching capacity. Killed task attempts are probably composed of two types: rejected task attempts because LLAP daemon is full and won't accept new work; and killed opportunistic non-finishable tasks (preemption). The latter happen because Hive starts some tasks (esp. reducers) before all inputs for them are ready, to be able to download the inputs from some upstream tasks while waiting for other upstream tasks to finish. When parallel queries want to run a task that can run and finish immediately, they would pre-empt non-finishable tasks (otherwise a task that is potentially doing nothing and waiting for something else to finish could take resouces from the tasks that are ready). This is normal with high volume concurrent queries that amount of preemption increases. The only way to check if there are any other (potentially problematic) kills now is to check the logs...
If cache is not as important for these queries you can try to reducehive.llap.task.scheduler.locality.delay, which may cause faster scheduling for tasks (-1 means infinite, the minimum otherwise is 0)..
However, once the cluster is at capacity, it's not realistic to expect sub-linear scaling... individual query runtime improvements would also improve aggregate runtime in this case.