About Tim Armstrong

Tim Armstrong · ‎09-27-2016

Thanks for the data point :). We're tracking the parallelisation work here: https://issues.cloudera.org/browse/IMPALA-3902 . It's probably going to get enabled in phases - we may have parallelisation for aggregations before joins for example.

Tim Armstrong · ‎09-23-2016

No I don't think you're missing any obvious optimisation. Yes we only use a single core per aggregation per Impala daemon. This is obviously not ideal so we have a big push right now to do full parallelization of every operator.

Tim Armstrong · ‎09-22-2016

It's aggregating 10 million rows per core per second which is within expectations - the main factor affecting performance We are currently working on multi-threaded joins and aggregation, which would increase the level of parallelism available in this case. There were also some improvements to the aggregation in Impala 2.6 (https://issues.cloudera.org/browse/IMPALA-3286) that might improve throughput a bit (I'd guess somewhere between 10% to 80% speedup depending on the input data).

Tim Armstrong · ‎09-13-2016

Eric's suggestion is the general solution to this problem - without stats Impala is choosing a bad join order and there are a lot of duplicates on the right side of the join. One workaround is to add a straight_join hint, which lets you control the order in which the tables are joined. I believe in your case just adding straight_join will flip the sides of the join, which will almost certainly help you. SELECT `dim_experiment`.`experiment_name` AS `experiment_name` FROM `gwynniebee_bi`.`fact_recommendation_events` `fact_recommendatio` LEFT OUTER JOIN `gwynniebee_bi`.`dim_experiment` `dim_experiment` ON (`fact_recommendatio`.`experiment_key` = `dim_experiment`.`experiment_key`) GROUP BY 1

Tim Armstrong · ‎08-15-2016

This is a known issue that we're actively working on: https://issues.cloudera.org/browse/IMPALA-2567 Your analysis is accurate. Part of the problem is the number of connections and the other part is the # of threads per connection. You may be able to change some operating system config settings to increase limits here (depending on which limit you're hitting). In order to reduce the # of tcp conncetions required you would either need to reduce the number of fragments or reduce the number of node executing the query. You could reduce the # of fragments by breaking up the query into smaller queries. E.g. creating temporary tables with the results of some of the subqueries. You could also try executing the query on a single node by setting num_nodes=1 if the data size is small enough that this makes sense. I suspect your query is too large for that to work, but it's hard to tell (that's a huge query plan!)

Tim Armstrong · ‎08-05-2016

I'm not the most knowledgeable person about this part of the code, but what you're saying is correct. One of the likely causes of long wait times is if the receiver is consuming data slower than the sender is sending it.

Tim Armstrong · ‎06-30-2016

I think we already have an open issue for this that is being actively worked on https://issues.cloudera.org/browse/IMPALA-3210 I.e. we don't support it yet but it's in the pipeline.

Tim Armstrong · ‎06-28-2016

The only way to do this with zero work would be to use a view. http://www.cloudera.com/documentation/enterprise/latest/topics/impala_create_view.html Otherwise you do have to run the queries as part of your data pipeline as you mentioned.

Tim Armstrong · ‎06-24-2016

That probably makes sense if the bottleneck is evaluating the where clause. If those extra rows are filtered out in the join, then the gain is limited, since you should filter out the extra rows during the scan or when evaluating the simple join condition. Our scans are multithreaded too, so sometimes if the join is the bottleneck, making the scans do more work doesn't slow down the query overall.

Tim Armstrong · ‎06-24-2016

The main difference seems to be execution skew. In the second profile the max time for the join is over 3 minutes, compared to much lower in first profile. The average time isn't very different between the profiles. Probably the partitioning resulted in the data being distributed differently between the nodes, and for some reason that one node is slower. It doesn't look like it's necessarily processing more data, but maybe the node is more heavily loaded, or the data is somehow different. Is the join condition something complicated? It's only processing a few thousand rows per second through the join, which is very low.

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: AGGREGATE of query is to long

Re: AGGREGATE of query is to long

Re: AGGREGATE of query is to long

Re: Memory limit exceeded cannot perform hash join

Re: Single Query (with 253 Plan Fragments) Causes ...

Re: total_network_send_timer and thrift_transmit_t...

Re: Forward filling and back filling from adjacent...

Re: How to create impala derived tables

Re: Difference between these profiles to explain t...

Re: Difference between these profiles to explain t...