Member since
07-29-2015
535
Posts
141
Kudos Received
103
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
7586 | 12-18-2020 01:46 PM | |
4971 | 12-16-2020 12:11 PM | |
3785 | 12-07-2020 01:47 PM | |
2471 | 12-07-2020 09:21 AM | |
1613 | 10-14-2020 11:15 AM |
09-27-2016
09:50 AM
1 Kudo
Thanks for the data point :). We're tracking the parallelisation work here: https://issues.cloudera.org/browse/IMPALA-3902 . It's probably going to get enabled in phases - we may have parallelisation for aggregations before joins for example.
... View more
09-23-2016
10:41 AM
No I don't think you're missing any obvious optimisation. Yes we only use a single core per aggregation per Impala daemon. This is obviously not ideal so we have a big push right now to do full parallelization of every operator.
... View more
09-22-2016
12:46 PM
It's aggregating 10 million rows per core per second which is within expectations - the main factor affecting performance We are currently working on multi-threaded joins and aggregation, which would increase the level of parallelism available in this case. There were also some improvements to the aggregation in Impala 2.6 (https://issues.cloudera.org/browse/IMPALA-3286) that might improve throughput a bit (I'd guess somewhere between 10% to 80% speedup depending on the input data).
... View more
09-13-2016
02:35 PM
Eric's suggestion is the general solution to this problem - without stats Impala is choosing a bad join order and there are a lot of duplicates on the right side of the join. One workaround is to add a straight_join hint, which lets you control the order in which the tables are joined. I believe in your case just adding straight_join will flip the sides of the join, which will almost certainly help you. SELECT `dim_experiment`.`experiment_name` AS `experiment_name`
FROM `gwynniebee_bi`.`fact_recommendation_events` `fact_recommendatio`
LEFT OUTER JOIN `gwynniebee_bi`.`dim_experiment` `dim_experiment` ON (`fact_recommendatio`.`experiment_key` = `dim_experiment`.`experiment_key`)
GROUP BY 1
... View more
08-15-2016
09:54 AM
This is a known issue that we're actively working on: https://issues.cloudera.org/browse/IMPALA-2567 Your analysis is accurate. Part of the problem is the number of connections and the other part is the # of threads per connection. You may be able to change some operating system config settings to increase limits here (depending on which limit you're hitting). In order to reduce the # of tcp conncetions required you would either need to reduce the number of fragments or reduce the number of node executing the query. You could reduce the # of fragments by breaking up the query into smaller queries. E.g. creating temporary tables with the results of some of the subqueries. You could also try executing the query on a single node by setting num_nodes=1 if the data size is small enough that this makes sense. I suspect your query is too large for that to work, but it's hard to tell (that's a huge query plan!)
... View more
08-05-2016
05:35 PM
I'm not the most knowledgeable person about this part of the code, but what you're saying is correct. One of the likely causes of long wait times is if the receiver is consuming data slower than the sender is sending it.
... View more
06-30-2016
08:53 AM
1 Kudo
I think we already have an open issue for this that is being actively worked on https://issues.cloudera.org/browse/IMPALA-3210 I.e. we don't support it yet but it's in the pipeline.
... View more
06-28-2016
07:53 AM
The only way to do this with zero work would be to use a view. http://www.cloudera.com/documentation/enterprise/latest/topics/impala_create_view.html Otherwise you do have to run the queries as part of your data pipeline as you mentioned.
... View more
06-24-2016
12:25 PM
That probably makes sense if the bottleneck is evaluating the where clause. If those extra rows are filtered out in the join, then the gain is limited, since you should filter out the extra rows during the scan or when evaluating the simple join condition. Our scans are multithreaded too, so sometimes if the join is the bottleneck, making the scans do more work doesn't slow down the query overall.
... View more
06-24-2016
11:50 AM
The main difference seems to be execution skew. In the second profile the max time for the join is over 3 minutes, compared to much lower in first profile. The average time isn't very different between the profiles. Probably the partitioning resulted in the data being distributed differently between the nodes, and for some reason that one node is slower. It doesn't look like it's necessarily processing more data, but maybe the node is more heavily loaded, or the data is somehow different. Is the join condition something complicated? It's only processing a few thousand rows per second through the join, which is very low.
... View more