Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

AGGREGATE of query is to long

SOLVED Go to solution

AGGREGATE of query is to long

Contributor

Hi,

I am running impala 2.5 on cdh 5.7.3.

I am currently bechmarking a simple query :

 select count(*),`session_id` from flat_table group by `session_id` limit 10;

Here is the results of 'summary' :

+--------------+--------+----------+----------+---------+------------+-----------+---------------+-----------------------------------------+
| Operator     | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                                  |
+--------------+--------+----------+----------+---------+------------+-----------+---------------+-----------------------------------------+
| 04:EXCHANGE  | 1      | 13.63us  | 13.63us  | 10      | 10         | 0 B       | -1 B          | UNPARTITIONED                           |
| 03:AGGREGATE | 6      | 1.11s    | 1.15s    | 60      | 247.06M    | 171.09 MB | 128.00 MB     | FINALIZE                                |
| 02:EXCHANGE  | 6      | 86.76ms  | 92.08ms  | 12.94M  | 247.06M    | 0 B       | 0 B           | HASH(session_id)                |
| 01:AGGREGATE | 6      | 4.07s    | 6.14s    | 12.94M  | 247.06M    | 525.03 MB | 128.00 MB     | STREAMING                               |
| 00:SCAN HDFS | 6      | 337.83ms | 494.40ms | 268.67M | 247.06M    | 145.36 MB | 88.00 MB      | flat_table |
+--------------+--------+----------+----------+---------+------------+-----------+---------------+-----------------------------------------+

We can easily see that most of the time is going into the aggrerate part. And I have a lot of query that have the same botleneck. 

I have control over hardware and  impala configuration. The table is  parquet table, cached in hdfs and with incremental stats for each partition.

Am I missing something or is this expected performances for a query like this?

 

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

Re: AGGREGATE of query is to long

Master Collaborator

No I don't think you're missing any obvious optimisation. Yes we only use a single core per aggregation per Impala daemon. This is obviously not ideal so we have a big push right now to do full parallelization of every operator.

5 REPLIES 5

Re: AGGREGATE of query is to long

Master Collaborator

It's aggregating 10 million rows per core per second which is within expectations - the main factor affecting performance

 

We are currently working on multi-threaded joins and aggregation, which would increase the level of parallelism available in this case. There were also some improvements to the aggregation in Impala 2.6 (https://issues.cloudera.org/browse/IMPALA-3286) that might improve throughput a bit (I'd guess somewhere between 10% to 80% speedup depending on the input data).

Re: AGGREGATE of query is to long

Contributor

Hi,
I will update to 2.6 over the week end and post the results.
I have 32 cores per hosts available to impala daemon.
If you say that 10 million record are being process in parallel, I guess you imply that only one core is used by host (268M rows/6hosts/4 sec = ~11million).
Is it expected to have only 1 core use per Node ? Did I miss something in the configuration?
Or is it because of the multi-threaded aggregation improvement that you are working on ?
I just want to make sure I didn't miss any obvious optimization.

And just to tell you the column is of type "string".

thanks

Re: AGGREGATE of query is to long

Master Collaborator

No I don't think you're missing any obvious optimisation. Yes we only use a single core per aggregation per Impala daemon. This is obviously not ideal so we have a big push right now to do full parallelization of every operator.

Re: AGGREGATE of query is to long

Contributor
Hi,
I upgraded impala to 2.6. The query aggregation improved by about 15%.
I there a open ticket or an expected release date/version for the "full parallelization" ?

thanks

Re: AGGREGATE of query is to long

Master Collaborator

Thanks for the data point :). 

 

We're tracking the parallelisation work here: https://issues.cloudera.org/browse/IMPALA-3902 . It's probably going to get enabled in phases - we may have parallelisation for aggregations before joins for example.