Support Questions
Find answers, ask questions, and share your expertise

LLAP Concurrent Queries - Killed Task Attempts

Explorer

I'm investigating running concurrent queries in LLAP (HDP 2.6.1).

When running more than 1 query I am seeing non-zero "Killed Task Attempts" in the DAG page in Ambari (Tez View -> DAG ID link). In some cases the number of "Killed Task Attempts" is into the hundreds. The logs show messages of the form:

2017-12-05T12:13:42,003  INFO [IPC Server handler 2 on 33976 (1512469369970_0013_2_00_000022_1)] impl.TaskRunnerCallable: Kill task requested for id=attempt_1512469369970_0013_2_00_000022_1, taskRunnerSetup=false
2017-12-05T12:13:42,003  INFO [IPC Server handler 2 on 33976 (1512469369970_0013_2_00_000022_1)] impl.ContainerRunnerImpl: SubmissionState for attempt_1512469369970_0013_2_00_000022_1 : REJECTED
2017-12-05T12:13:42,009  INFO [IPC Server handler 4 on 33976 (1512469369970_0013_2_00_000004_1)] impl.TaskExecutorService: wait queue full, size=10. numSlotsAvailable=0, runningFragmentCount=3. attempt_1512469369970_0013_2_00_000004_1 not added

In some cases REJECTED is EVICTED_OTHER.

The query in isolation executed in 78s. When two executions of the same query run concurrently the execution times were 125s and 129s. As the number of concurrent executions is increased the mean elapsed time continues to rise. There is variation in elapsed time between runs whenever there is more than 1 query execution at a time. There is also a correlation between the elapsed time and the number of "Killed Task Attempts". The following is from executing the query 3 times concurrently:

Execution # Elapsed Time/s Killed Task Attempts
----------- -------------- --------------------
1                       90                    1
2                      159                  110
3                      190                  182

I'd be very grateful for guidance on what is happening and why.

Am I hitting https://issues.apache.org/jira/browse/HIVE-15779?

Thanks,

Martin

8 REPLIES 8

Re: LLAP Concurrent Queries - Killed Task Attempts

Expert Contributor

What are your relevant Hive config parameters such as # concurrent queries? How much memory do you have per node and how many nodes do you have? How much of that memory is allocated to LLAP?

There are various things that could be the issue so we will have to go through and check them.

Re: LLAP Concurrent Queries - Killed Task Attempts

Explorer

Hi anarasimham,

Thanks for your response. To answer your questions:

  • # concurrent queries: 3
  • Memory per node: 62G
  • Number of nodes: 3
  • Memory allocated to LLAP: "Memory per Daemon" is 18432; llap0 shows 56320MB allocated in YARN ResourceManager UI

Please let me know if there is anything else you would like to know.

Thanks,

Martin

Re: LLAP Concurrent Queries - Killed Task Attempts

Contributor

Hi, this version should already have the fix for this issu (HIVE-15779).

As of 2.6.X, there is no workload management and balancing functionality in LLAP, so there are indeed usually a lot of cancellations reported for parallel queries. Some of them are a reporting issue (rejected tasks are expected when AM is probing LLAPs for empty space - this is planned to be fixed in 3.0). Still, if I understand correctly, going from 78s to 129s completion means running 2 queries in parallel is faster than sequentially, assuming the cluster if fully utilized it is a good result; both queries get ~half the allocation, and thus effectively run with half the cluster size; hence they take longer. Is this the case?

In 3.0, we are implementing workload management for LLAP for better resource sharing, and the ability to specify policies for queries (sharing the cluster fairly, or in some proportions, etc.)

Re: LLAP Concurrent Queries - Killed Task Attempts

Explorer

Thank you for your response. Indeed, 129s is less than 2 x 78s.

I've looked into cluster utilisation and your point about the cluster being fully utilised when executing the single query is well made.

Re: LLAP Concurrent Queries - Killed Task Attempts

Explorer

Hi @Sergey Shelukhin, can you provide details on the Hive-15579 patch already being in 2.6.1? I don't see it listed in the Release Notes anywhere and we are having issues with LLAP and literally thousands of Killed Task Attempts for complex queries, though in most cases they do eventually finish successfully. That makes me wonder if we're seeing this bug because the tasks get killed and then restarted and run successfully. I appreciate your thoughts. Thanks!

Re: LLAP Concurrent Queries - Killed Task Attempts

Explorer

I mean Hive-15779 - typo above. Thanks!

Re: LLAP Concurrent Queries - Killed Task Attempts

New Contributor

We are observing the same issue when we run large queries or too many queries, @Sergey Shelukhin has it been fixed for 2.6.4?The query ultimately fails with vertex error. We are unable to find the reason as of now.

Re: LLAP Concurrent Queries - Killed Task Attempts

New Contributor

Also, wait queue full, size=10, is the queue size a configurable number? We've looked at all our Hive-Tez configs and haven't found any parameter which controls this.