Member since
07-29-2015
535
Posts
140
Kudos Received
102
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1993 | 12-18-2020 01:46 PM | |
1355 | 12-16-2020 12:11 PM | |
838 | 12-07-2020 01:47 PM | |
779 | 12-07-2020 09:21 AM | |
455 | 10-14-2020 11:15 AM |
12-05-2016
11:05 AM
lt looks like maybe your catalog service is having problems. It would be worth looking in the catalogd logs for clues.
... View more
11-23-2016
03:32 PM
1 Kudo
We had an issue filed for this a while back: https://issues.cloudera.org/browse/IMPALA-3293 . It seems fairly reasonable but I think will depend on how much demand there is for it (or if someone contributes a patch for it).
... View more
11-23-2016
09:18 AM
1 Kudo
You're absolutely right - we use 10% as the default estimate for selectivity for scan predicates when we don't have a better estimate. One case where we have a better estimate is when the predicate is something like id = 100. In that case we can estimate that the selectivity is 1 / (num distinct values). There's also some logic to handle combining the estimates when there are multiple conditions. If you're curious, the code is here: https://github.com/apache/incubator-impala/blob/4db330e69a2dbb4a23f46e34b484da0d6b9ef29b/fe/src/main/java/org/apache/impala/planner/PlanNode.java#L518
... View more
11-18-2016
06:03 PM
We added support for --ldap_password_cmd in Impala 2.5, which I think addresses this problem. See https://issues.cloudera.org/browse/IMPALA-1934 https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_shell_options.html
... View more
11-18-2016
06:01 PM
This would typically happen if the catalog daemon was restarted.
... View more
11-16-2016
03:13 PM
If you're using impala-shell, you can use the "summary;" command. Otherwise it's accessible through the Impala debug web pages (typically http://the-impala-server:25000)
... View more
11-08-2016
05:24 PM
Please do open a JIRA - it's always good to have some context on the problem from users. It looks like the scanners in that profile are just idle (based on the user and system time) - so my guess is that the slowdown is something further up on the plan.
... View more
11-07-2016
02:34 PM
We could definitely improve some of the diagnostics there. My guess is that one node is either overloaded or has some kind of hardware issue - might be worth looking at the health and CPU/memory usage of different nodes to see if one stands out.
... View more
11-04-2016
05:25 PM
1 Kudo
One thing to keep in mind when interpreting the profiles is that a series of joins will typically be pipelined to avoid materialising results. This means that the whole pipeline runs at the speed of the slowest part of the pipeline. So the limiting factor could be the client (if you're returning a lot of results), the scan at the bottom of the plan, or any of the joins in the pipeline. TotalNetworkSendTime may be somewhat misleading since if the sender is running faster than the receiver, a backpressure mechanism kicks in that blocks the sender until the receiver has caught up. What's I'd recommend initially is comparing query summaries of the fast and slow queries to see where the difference in time is. If you're running in impala-shell you can get the summary of the last query by typing "summary;"
... View more
10-13-2016
03:20 PM
Good to hear! Please feel free to mark it as solved to make it easier for others to find.
... View more
10-03-2016
02:25 PM
Some examples of the calculations and numbers would be helpful. We use a C++ double as the underlying type, so have the same precision. There are a lot of subtleties with floating point numbers where calculations that are mathematically equivalent with real numbers can give different results with floating point numbers. E.g floating point arithmetic is not associative, so it's not guaranteed that a + b + c == a + c + b. On x86 there's also some additional weirdness where intermediate results of calculations are represented with 80-bits if they're kept in floating-point registers but reduced in precision to 64-bits if they're written to memory: https://en.wikipedia.org/wiki/Extended_precision. At the C++ or SQL levels you have very little control over which precision is used. Fixed-precision decimal will give you more predictable results if your application isn't tolerant to rounding errors.
... View more
09-27-2016
09:50 AM
1 Kudo
Thanks for the data point :). We're tracking the parallelisation work here: https://issues.cloudera.org/browse/IMPALA-3902 . It's probably going to get enabled in phases - we may have parallelisation for aggregations before joins for example.
... View more
09-23-2016
10:41 AM
No I don't think you're missing any obvious optimisation. Yes we only use a single core per aggregation per Impala daemon. This is obviously not ideal so we have a big push right now to do full parallelization of every operator.
... View more
09-22-2016
12:46 PM
It's aggregating 10 million rows per core per second which is within expectations - the main factor affecting performance We are currently working on multi-threaded joins and aggregation, which would increase the level of parallelism available in this case. There were also some improvements to the aggregation in Impala 2.6 (https://issues.cloudera.org/browse/IMPALA-3286) that might improve throughput a bit (I'd guess somewhere between 10% to 80% speedup depending on the input data).
... View more
09-13-2016
02:35 PM
Eric's suggestion is the general solution to this problem - without stats Impala is choosing a bad join order and there are a lot of duplicates on the right side of the join. One workaround is to add a straight_join hint, which lets you control the order in which the tables are joined. I believe in your case just adding straight_join will flip the sides of the join, which will almost certainly help you. SELECT `dim_experiment`.`experiment_name` AS `experiment_name`
FROM `gwynniebee_bi`.`fact_recommendation_events` `fact_recommendatio`
LEFT OUTER JOIN `gwynniebee_bi`.`dim_experiment` `dim_experiment` ON (`fact_recommendatio`.`experiment_key` = `dim_experiment`.`experiment_key`)
GROUP BY 1
... View more
08-15-2016
09:54 AM
This is a known issue that we're actively working on: https://issues.cloudera.org/browse/IMPALA-2567 Your analysis is accurate. Part of the problem is the number of connections and the other part is the # of threads per connection. You may be able to change some operating system config settings to increase limits here (depending on which limit you're hitting). In order to reduce the # of tcp conncetions required you would either need to reduce the number of fragments or reduce the number of node executing the query. You could reduce the # of fragments by breaking up the query into smaller queries. E.g. creating temporary tables with the results of some of the subqueries. You could also try executing the query on a single node by setting num_nodes=1 if the data size is small enough that this makes sense. I suspect your query is too large for that to work, but it's hard to tell (that's a huge query plan!)
... View more
08-05-2016
05:35 PM
I'm not the most knowledgeable person about this part of the code, but what you're saying is correct. One of the likely causes of long wait times is if the receiver is consuming data slower than the sender is sending it.
... View more
06-30-2016
08:53 AM
1 Kudo
I think we already have an open issue for this that is being actively worked on https://issues.cloudera.org/browse/IMPALA-3210 I.e. we don't support it yet but it's in the pipeline.
... View more
06-28-2016
07:53 AM
The only way to do this with zero work would be to use a view. http://www.cloudera.com/documentation/enterprise/latest/topics/impala_create_view.html Otherwise you do have to run the queries as part of your data pipeline as you mentioned.
... View more
06-24-2016
12:25 PM
That probably makes sense if the bottleneck is evaluating the where clause. If those extra rows are filtered out in the join, then the gain is limited, since you should filter out the extra rows during the scan or when evaluating the simple join condition. Our scans are multithreaded too, so sometimes if the join is the bottleneck, making the scans do more work doesn't slow down the query overall.
... View more
06-24-2016
11:50 AM
The main difference seems to be execution skew. In the second profile the max time for the join is over 3 minutes, compared to much lower in first profile. The average time isn't very different between the profiles. Probably the partitioning resulted in the data being distributed differently between the nodes, and for some reason that one node is slower. It doesn't look like it's necessarily processing more data, but maybe the node is more heavily loaded, or the data is somehow different. Is the join condition something complicated? It's only processing a few thousand rows per second through the join, which is very low.
... View more
06-23-2016
06:42 PM
I don't think it's on the immediate roadmap, our focus recently has been on various other things (performance, Amazon EC2 support, etc)
... View more
06-23-2016
01:05 PM
That feature has not made it in unfortunately. The documentation at http://www.cloudera.com/documentation.html is the source of truth about what features are or are not present.
... View more
06-17-2016
04:04 PM
Ok, that's interesting. The nonsense-looking symbol up the top is probably jitted code from your query, probably an expression or something like that. 36.50% perf-18476.map [.] 0x00007f3c1d634b82 The other symbols like GetDoubleVal() may be what is calling this expensive function. It looks like it's possible ProbeTime in the profile that's the culprit. Can you share the SQL for your query at all? I'm guessing that there's some expression in your query that's expensive to evaluate. E.g. joining on some complex expression, or doing some kind of expensive computation.
... View more
06-17-2016
11:23 AM
Maybe run 'perf top' to see where it's spending the time? I'd expect the scan to run on one core and the join and insert to run on a different core.
... View more
06-17-2016
09:19 AM
There's something strange going on here, the profile reports that the scan took around 12 seconds of CPU time, but 17 minutes of wall-clock time. So for whatever reason the scan is spending most of its time swapped out and unable to execute. - MaterializeTupleTime(*): 17m20s - ScannerThreadsSysTime: 74.049ms - ScannerThreadsUserTime: 12s312ms Is the system under heavy load or is it swapping to disk?
... View more
05-23-2016
08:44 AM
Yes, that's right. It's enabled for some cases by default (broadcast joins) in Impala 2.5. To enable it for a wider category of joins you can set the query option runtime_filter_mode=global. This setting will become the default in Impala 2.7 because of the performance benefits.
... View more
05-02-2016
09:21 PM
We often use TPC-H and TPC-DS, they're pretty standard for analytical databases. There's a TPC-DS kit for Impala here: https://github.com/cloudera/impala-tpcds-kit
... View more
05-02-2016
09:19 PM
There's no direct way to find out from the profile unfortunately. If you have a live system you can look at the /threadz page on the impala debug web page (port 25000 on each Impala daemon by default) to see how many hdfs-scan-node threads are running.
... View more
04-29-2016
10:59 AM
Impala limits the number of threads executing the query plan by design. Impala dynamically increases the number of scanner threads provided there are CPU and memory resources available - in this case it seems like there weren't CPU resource available. If the machine is already busy adding more threads can actually decrease query throughput.
... View more
- « Previous
- Next »