<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: AGGREGATE of query is to long in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45496#M41392</link>
    <description>&lt;P&gt;No I don't think you're missing any obvious optimisation. Yes we only use a single core per aggregation per Impala daemon. This is obviously not ideal so we have a big push right now to do full parallelization of every operator.&lt;/P&gt;</description>
    <pubDate>Fri, 23 Sep 2016 17:41:00 GMT</pubDate>
    <dc:creator>Tim Armstrong</dc:creator>
    <dc:date>2016-09-23T17:41:00Z</dc:date>
    <item>
      <title>AGGREGATE of query is to long</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45408#M41389</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I am running impala 2.5 on cdh 5.7.3.&lt;/P&gt;&lt;P&gt;I am currently bechmarking a simple query :&lt;/P&gt;&lt;PRE&gt; select count(*),`session_id` from flat_table group by `session_id` limit 10;&lt;/PRE&gt;&lt;P&gt;Here is the results of 'summary'&amp;nbsp;:&lt;/P&gt;&lt;PRE&gt;+--------------+--------+----------+----------+---------+------------+-----------+---------------+-----------------------------------------+
| Operator     | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                                  |
+--------------+--------+----------+----------+---------+------------+-----------+---------------+-----------------------------------------+
| 04:EXCHANGE  | 1      | 13.63us  | 13.63us  | 10      | 10         | 0 B       | -1 B          | UNPARTITIONED                           |
| 03:AGGREGATE | 6      | 1.11s    | 1.15s    | 60      | 247.06M    | 171.09 MB | 128.00 MB     | FINALIZE                                |
| 02:EXCHANGE  | 6      | 86.76ms  | 92.08ms  | 12.94M  | 247.06M    | 0 B       | 0 B           | HASH(session_id)                |
| 01:AGGREGATE | 6      | 4.07s    | 6.14s    | 12.94M  | 247.06M    | 525.03 MB | 128.00 MB     | STREAMING                               |
| 00:SCAN HDFS | 6      | 337.83ms | 494.40ms | 268.67M | 247.06M    | 145.36 MB | 88.00 MB      | flat_table |
+--------------+--------+----------+----------+---------+------------+-----------+---------------+-----------------------------------------+&lt;/PRE&gt;&lt;P&gt;We can easily see that most of the time is going into the aggrerate part. And I have a lot of query that have the same botleneck.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I&amp;nbsp;have control over hardware and &amp;nbsp;impala configuration. The table is &amp;nbsp;parquet table, cached in hdfs and with incremental stats for each partition.&lt;/P&gt;&lt;P&gt;Am I missing something or is this expected performances for a query like this?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 10:40:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45408#M41389</guid>
      <dc:creator>maurin</dc:creator>
      <dc:date>2022-09-16T10:40:33Z</dc:date>
    </item>
    <item>
      <title>Re: AGGREGATE of query is to long</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45443#M41390</link>
      <description>&lt;P&gt;It's aggregating 10 million rows per core per second which is within expectations - the main factor affecting performance&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We are currently working on multi-threaded joins and aggregation, which would increase the level of parallelism available in this case. There were also some improvements to the aggregation in Impala 2.6 (&lt;A href="https://issues.cloudera.org/browse/IMPALA-3286" target="_blank"&gt;https://issues.cloudera.org/browse/IMPALA-3286&lt;/A&gt;) that might improve throughput a bit (I'd guess somewhere between 10% to 80% speedup depending on the input data).&lt;/P&gt;</description>
      <pubDate>Thu, 22 Sep 2016 19:46:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45443#M41390</guid>
      <dc:creator>Tim Armstrong</dc:creator>
      <dc:date>2016-09-22T19:46:46Z</dc:date>
    </item>
    <item>
      <title>Re: AGGREGATE of query is to long</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45459#M41391</link>
      <description>&lt;P&gt;Hi,&lt;BR /&gt;I will update to 2.6 over the week end and post the results.&lt;BR /&gt;I have 32 cores per hosts available to impala daemon.&lt;BR /&gt;If you say that 10 million record are being process in parallel, I guess you imply that only one core is used by host (268M rows/6hosts/4 sec = ~11million).&lt;BR /&gt;Is it expected to have only 1 core use per Node&amp;nbsp;? Did I miss something in the configuration?&lt;BR /&gt;Or is it because of the multi-threaded aggregation improvement that you are working on ?&lt;BR /&gt;I just want to make sure I didn't miss any obvious optimization.&lt;/P&gt;&lt;P&gt;And just to tell you the column is&amp;nbsp;of type "string".&lt;BR /&gt;&lt;BR /&gt;thanks&lt;/P&gt;</description>
      <pubDate>Fri, 23 Sep 2016 03:26:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45459#M41391</guid>
      <dc:creator>maurin</dc:creator>
      <dc:date>2016-09-23T03:26:55Z</dc:date>
    </item>
    <item>
      <title>Re: AGGREGATE of query is to long</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45496#M41392</link>
      <description>&lt;P&gt;No I don't think you're missing any obvious optimisation. Yes we only use a single core per aggregation per Impala daemon. This is obviously not ideal so we have a big push right now to do full parallelization of every operator.&lt;/P&gt;</description>
      <pubDate>Fri, 23 Sep 2016 17:41:00 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45496#M41392</guid>
      <dc:creator>Tim Armstrong</dc:creator>
      <dc:date>2016-09-23T17:41:00Z</dc:date>
    </item>
    <item>
      <title>Re: AGGREGATE of query is to long</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45611#M41393</link>
      <description>Hi,&lt;BR /&gt;I upgraded impala to 2.6. The query aggregation improved by about 15%.&lt;BR /&gt;I there a open ticket or an expected release date/version for the "full parallelization" ?&lt;BR /&gt;&lt;BR /&gt;thanks</description>
      <pubDate>Mon, 26 Sep 2016 19:52:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45611#M41393</guid>
      <dc:creator>maurin</dc:creator>
      <dc:date>2016-09-26T19:52:43Z</dc:date>
    </item>
    <item>
      <title>Re: AGGREGATE of query is to long</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45645#M41394</link>
      <description>&lt;P&gt;Thanks for the data point :).&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We're tracking the parallelisation work&amp;nbsp;here:&amp;nbsp;&lt;A href="https://issues.cloudera.org/browse/IMPALA-3902" target="_blank"&gt;https://issues.cloudera.org/browse/IMPALA-3902&lt;/A&gt; . It's probably going to get enabled in phases&amp;nbsp;- we may have parallelisation for aggregations before joins for example.&lt;/P&gt;</description>
      <pubDate>Tue, 27 Sep 2016 16:50:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/AGGREGATE-of-query-is-to-long/m-p/45645#M41394</guid>
      <dc:creator>Tim Armstrong</dc:creator>
      <dc:date>2016-09-27T16:50:33Z</dc:date>
    </item>
  </channel>
</rss>

