Support Questions

james_kuo · ‎12-06-2017

We have been testing LLAP concurrency for a couple months now. A little info on our test cluster below.

4 data nodes. Each node has 32 physical cores (2 vcores per) and 96GB memory.
There is one LLAP queue with 50% Capacity and 50% Max Capacity..
Concurrent Queries = 50
2 Daemons & Number of executors per LLAP Daemon = 48 (1GB container)

We can get results back pretty quick (under 2 seconds) below 12 concurrent queries. However, once it reaches 18 and above, we started seeing longer runtime and a lot of killed task attempts (see image). I believe it contributes to the longer rumtime. Is it enough info to figure out which configuration setup can avoid those killed tasks?

HDP version 2.6

Thanks.

sergey · ‎12-12-2017

Yeah I suspect this is where cluster is reaching capacity. Killed task attempts are probably composed of two types: rejected task attempts because LLAP daemon is full and won't accept new work; and killed opportunistic non-finishable tasks (preemption). The latter happen because Hive starts some tasks (esp. reducers) before all inputs for them are ready, to be able to download the inputs from some upstream tasks while waiting for other upstream tasks to finish. When parallel queries want to run a task that can run and finish immediately, they would pre-empt non-finishable tasks (otherwise a task that is potentially doing nothing and waiting for something else to finish could take resouces from the tasks that are ready). This is normal with high volume concurrent queries that amount of preemption increases. The only way to check if there are any other (potentially problematic) kills now is to check the logs...

If cache is not as important for these queries you can try to reducehive.llap.task.scheduler.locality.delay, which may cause faster scheduling for tasks (-1 means infinite, the minimum otherwise is 0)..

However, once the cluster is at capacity, it's not realistic to expect sub-linear scaling... individual query runtime improvements would also improve aggregate runtime in this case.

View solution in original post

sergey · ‎12-08-2017

If 6 queries take 4 seconds, scaling linearly would mean 50 queries should take ~33 seconds if the cluster was at capacity. I'm assuming there's also some slack, however unless queries are literally 2-3 tasks, that probably means at some point the cluster reaches capacity and the tasks from different queries start queueing up (even if they do run faster due to cache, JIT, etc.). Are 54 executors per machine, or total?By 50% capacity do you mean 4 LLAP with 50% of a node, or 2 LLAPs with full node? In absence of other information 13sec (which is much better than linear scaling) actually looks pretty good. One would need to look at query details and especially cluster utilization with different number of queries to determine if one needs parallel query improvement, or (which is more likely) single-query improvement because the tasks are simply waiting for other tasks with cluster at capacity.

Also, 4 queries by default is just a reasonable default for most people starting out; parallel query capacity actually requires some resources (to have AMs present; otherwise AM startup will affect perf a lot), so we chose a value on the low end.

As for memory usage per query, the counters at the end and in Tez UI output cache/IO memory used, however Java memory use by execution is not tracked well right now.

james_kuo · ‎12-11-2017

Thanks for answering my original questions, but I have to revise my questions because of a coding issue. The new result is actually better. I list the original questions below for reference.

Now, if running 6 unique queries concurrently take 4 seconds, is it a wrong expectation to think 50 concurrent queries (of the 6 unique queries) will return much faster than 13 seconds assuming all setup is correct, no IO issue, and it is truly concurrent?
By default, the recommended concurrent queries is 4. (Why 4?)
Also, is it possible to find out memory usage of each query?

sergey · ‎12-12-2017

Yeah I suspect this is where cluster is reaching capacity. Killed task attempts are probably composed of two types: rejected task attempts because LLAP daemon is full and won't accept new work; and killed opportunistic non-finishable tasks (preemption). The latter happen because Hive starts some tasks (esp. reducers) before all inputs for them are ready, to be able to download the inputs from some upstream tasks while waiting for other upstream tasks to finish. When parallel queries want to run a task that can run and finish immediately, they would pre-empt non-finishable tasks (otherwise a task that is potentially doing nothing and waiting for something else to finish could take resouces from the tasks that are ready). This is normal with high volume concurrent queries that amount of preemption increases. The only way to check if there are any other (potentially problematic) kills now is to check the logs...

If cache is not as important for these queries you can try to reducehive.llap.task.scheduler.locality.delay, which may cause faster scheduling for tasks (-1 means infinite, the minimum otherwise is 0)..

However, once the cluster is at capacity, it's not realistic to expect sub-linear scaling... individual query runtime improvements would also improve aggregate runtime in this case.

Cloudera Community

Support Questions

LLAP concurrency performance question