About Tim Armstrong

Tim Armstrong · ‎05-06-2019

It'd be helpful to post your impala version too. It seems unlikely that either SQL engine would return incorrect results on a straightforward query like this. I'd suggest looking at a subset of the data and breaking down the query until you can see where the different lies, e.g. select * FROM tag s INNER JOIN has_tags ht on S.TagNo = HT.TagNo and S.CategoryCode = HT.CategoryCode WHERE ht.categorycode = 'SYS'

Tim Armstrong · ‎04-26-2019

We have "Java Heap Size of Impala Daemon in Bytes" in CM 6.1+ and 5.16+. Before that you had to use the "Impala Daemon Environment Advanced Configuration Snippet" safety valve. Here's an example from an internal cluster with dedicated coordinators set up.

Tim Armstrong · ‎04-19-2019

https://www.cloudera.com/documentation/enterprise/latest/topics/impala_explain_plan.html#explain_plan is our high level doc. I would recommend starting with summary to understand where time is spent, then using the profile to drill down into individual nodes. WorkloadXM can help a lot automate the analysis process to understand bottlenecks.

Tim Armstrong · ‎04-18-2019

If you are mainly accessing the table using Impala, I'd recommend Impala's compute stats for best performance of Impala. There are some subtle differences in the stats collected (whether they're partition or table-level). The engines can interoperate but Impala can generally generate better plans with the full set of stats from "COMPUTE STATS"

Tim Armstrong · ‎04-17-2019

In it's default configuration, metadata is cached until an "INVALIDATE METADATA" command evicts the table from the cache. Or until the catalog is restarted. In 5.16 and 6.1+ there are some non-default options that will evict metadata after a particular timeout. At some point these will become the defaults. Table stats are collected and stored in the hive metastore when you run a "compute stats" command. They are then just part of the table metadata.

Tim Armstrong · ‎04-17-2019

I think https://www.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html#admission_yarn largely answers your question. There's no supported YARN/Impala integration - they each manage their own resources separately. LLAMA was an integration point but it was deprecated and removed from Impala several years ago.

Tim Armstrong · ‎04-16-2019

Impala caches all table metadata, so planning is generally faster once the table has been referenced by a previous query. You can see the "Planner Timeline" in the IMpala query profile to get a time breakdown of planning including metadata loading.

Tim Armstrong · ‎04-08-2019

If you want exact precision to a number of decimal digits, I'd recommend using the DECIMAL data type. Floating point can't exactly represent decimal numbers. If you're returning a floating point type from a query, then you don't have any real control over the display because it's basically a client-side formatting decision.E.g. your Java code that uses the JDBC drive could take the value and format it however it wants. You *might* be able to get the desired behaviour in impala-shell by using the round() function but that depends on some undocumented behaviour. I'd recommend looking at decimal.

Tim Armstrong · ‎04-05-2019

If you have more than a handful of users it becomes difficult to manage the large number of pools. Resource limits are also of limited use - you can limit the total consumption per user, but you can't guarantee that any group of users gets memory.

Tim Armstrong · ‎04-05-2019

You can apply memory limits at two levels - at the Impala daemon level, which limits the total memory consumption of the process (in part so that it doesn't exceed the physical memory available, but also so that it leaves memory available for other services running on the host). You can (and should) also apply memory limits at the query level via the MEM_LIMIT query option (the one we were talking about). That controls how much of the process memory limit that a single query can get. E.g. if you're using admission control you can configure query memory limits that get applied to all queries in a resource pool. It would be weird if running a query resulted in the impala daemon memory limit to change and I'm not sure what you would even expect to happen if you ran two queries at the same time. I don't know if this helps, but I gave a talk recently that summarised some of the concepts here. There are slides linked from here - https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/73000 By the way, only allocating 1GB to each impala daemon is a bad idea for a production deployment - that's simply not enough to run a lot of more complex queries on larger data sets, particularly if you are running multiple concurrent queries. We have some sizing guidelines - https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html#concept_usf_qln_3bb

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: Same query, same data, different results betwe...

Re: impala java heap config

Re: How to understand / analyse Impala Query Text ...

Re: COMPUTE Stats or Analyze table

Re: Does IMPALA cached the query statistics?

Re: Does Impala uses fair scheduler? and YARN for ...

Re: Does IMPALA cached the query statistics?

Re: IMPALA float vs double fraction part usage

Re: Why Impala Admission Control root.[username] i...

Re: Impala mem_limit query option is not working