Member since
07-29-2015
535
Posts
140
Kudos Received
103
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4689 | 12-18-2020 01:46 PM | |
2938 | 12-16-2020 12:11 PM | |
1985 | 12-07-2020 01:47 PM | |
1532 | 12-07-2020 09:21 AM | |
999 | 10-14-2020 11:15 AM |
05-06-2019
05:10 PM
It'd be helpful to post your impala version too. It seems unlikely that either SQL engine would return incorrect results on a straightforward query like this. I'd suggest looking at a subset of the data and breaking down the query until you can see where the different lies, e.g. select *
FROM tag s
INNER JOIN has_tags ht on S.TagNo = HT.TagNo and S.CategoryCode = HT.CategoryCode
WHERE ht.categorycode = 'SYS'
... View more
04-26-2019
08:58 AM
1 Kudo
We have "Java Heap Size of Impala Daemon in Bytes" in CM 6.1+ and 5.16+. Before that you had to use the "Impala Daemon Environment Advanced Configuration Snippet" safety valve. Here's an example from an internal cluster with dedicated coordinators set up.
... View more
04-19-2019
02:30 PM
1 Kudo
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_explain_plan.html#explain_plan is our high level doc. I would recommend starting with summary to understand where time is spent, then using the profile to drill down into individual nodes. WorkloadXM can help a lot automate the analysis process to understand bottlenecks.
... View more
04-18-2019
10:09 AM
2 Kudos
If you are mainly accessing the table using Impala, I'd recommend Impala's compute stats for best performance of Impala. There are some subtle differences in the stats collected (whether they're partition or table-level). The engines can interoperate but Impala can generally generate better plans with the full set of stats from "COMPUTE STATS"
... View more
04-17-2019
06:00 PM
1 Kudo
In it's default configuration, metadata is cached until an "INVALIDATE METADATA" command evicts the table from the cache. Or until the catalog is restarted. In 5.16 and 6.1+ there are some non-default options that will evict metadata after a particular timeout. At some point these will become the defaults. Table stats are collected and stored in the hive metastore when you run a "compute stats" command. They are then just part of the table metadata.
... View more
04-17-2019
11:04 AM
1 Kudo
I think https://www.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html#admission_yarn largely answers your question. There's no supported YARN/Impala integration - they each manage their own resources separately. LLAMA was an integration point but it was deprecated and removed from Impala several years ago.
... View more
04-16-2019
04:30 PM
3 Kudos
Impala caches all table metadata, so planning is generally faster once the table has been referenced by a previous query. You can see the "Planner Timeline" in the IMpala query profile to get a time breakdown of planning including metadata loading.
... View more
04-08-2019
12:22 PM
If you want exact precision to a number of decimal digits, I'd recommend using the DECIMAL data type. Floating point can't exactly represent decimal numbers. If you're returning a floating point type from a query, then you don't have any real control over the display because it's basically a client-side formatting decision.E.g. your Java code that uses the JDBC drive could take the value and format it however it wants. You *might* be able to get the desired behaviour in impala-shell by using the round() function but that depends on some undocumented behaviour. I'd recommend looking at decimal.
... View more
04-05-2019
08:54 AM
If you have more than a handful of users it becomes difficult to manage the large number of pools. Resource limits are also of limited use - you can limit the total consumption per user, but you can't guarantee that any group of users gets memory.
... View more
04-05-2019
08:53 AM
1 Kudo
You can apply memory limits at two levels - at the Impala daemon level, which limits the total memory consumption of the process (in part so that it doesn't exceed the physical memory available, but also so that it leaves memory available for other services running on the host). You can (and should) also apply memory limits at the query level via the MEM_LIMIT query option (the one we were talking about). That controls how much of the process memory limit that a single query can get. E.g. if you're using admission control you can configure query memory limits that get applied to all queries in a resource pool. It would be weird if running a query resulted in the impala daemon memory limit to change and I'm not sure what you would even expect to happen if you ran two queries at the same time. I don't know if this helps, but I gave a talk recently that summarised some of the concepts here. There are slides linked from here - https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/73000 By the way, only allocating 1GB to each impala daemon is a bad idea for a production deployment - that's simply not enough to run a lot of more complex queries on larger data sets, particularly if you are running multiple concurrent queries. We have some sizing guidelines - https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html#concept_usf_qln_3bb
... View more