About Tim Armstrong

Tim Armstrong · ‎07-24-2020

The row counts reflect the status of the partition or table the last time its stats were updated by "compute stats" in Impala (or analyze in Hive). Or that the stats were updated manually via an alter table. (There are also other cases where stats are updated, e.g. they can be automatically gathered by hive, but those are a few examples). One scenario where this could happen is if a partition was dropped since the last compute stats was run. The stats generally can be out of sync with the # of rows in the underlying table - we don't use them for answering queries, just for query optimization, so it's fine if they're a little inaccurate. If you want to know the accurate counts, you can run queries like select count(*) from table; select count(*) from table where business_date = "13/05/2020" and tec_execution_date = "13/05/2020 20:08;

Tim Armstrong · ‎07-21-2020

@hsri it seems like this would merit some more investigation - this was added as a nicety a little while back but it may not be working as expected. If you can reproduce this with a simple query, could you file a bug on Apache Impala? https://cwiki.apache.org/confluence/display/IMPALA/Contributing+to+Impala

Tim Armstrong · ‎07-20-2020

I really would suggest looking at whether the particular feature you want are in CDH6.3.3. We do backport a lot of features. E.g the GPU scheduling features for YARN for Hadoop 3.1 were included in CDH 6.2 https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_620_new_features.html#hadoop_new_620. If the question is whether you can run a non-CDH version of Hadoop, and still be running CDH, then the answer is no. Or if non-CDH releases of Hadoop are supported by Cloudera - also no. We only release and support CDH versions that have been fully integrated and tested against the other CDH components. If the question is whether there is a way to take Apache Hadoop release and deploy it in a Cloudera Manager cluster, then no - it's not packaged in the right way

Tim Armstrong · ‎07-20-2020

I think you're misunderstanding what CDH is. Hadoop in CDH is not a straight repackaging of an upstream Apache Hadoop release - it is based on an Apache Hadoop release but with a lot of enhancements, security and bug fixes based on our own testing and integration work and our experience working with customers running this in production. Our goal is that it should be more production-ready and battle tested than any Apache Hadoop release. So CDH 6.3.3 includes a lot of the improvements from post-3.0.0 Hadoop versions. If you want to see what was added in each version, the release notes have a lot of info - https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_6_release_notes.html#cdh6_release_notes

Tim Armstrong · ‎07-15-2020

Yeah this isn't configurable.

Tim Armstrong · ‎07-15-2020

I'm not sure exactly what is going on there then, we could always investigate if we had an example. But I'd expect that using the full profile tree for information about query status, etc is more robust since it's kept up to date throughout the query. The exec_summary is a more recent add-on to the profile and is updated in a different way.

Tim Armstrong · ‎07-15-2020

The exec summary isn't always going to be valid. It's only relevant if there was some execution that happened like in SELECT queries or DML. It won't be there for DDL. It's also only updated at certain points in the query running, so may not be present or up to date if the query hits an error or finishes early.

Tim Armstrong · ‎05-11-2020

We don't support it or have near-term plans to support it. We have an alternate strategy for S3 perf in Impala, based on our high-performance native Parquet implementation and our remote data cache. The idea is to cache all frequently-accessed parts of Parquet files (footers, columns referenced by queries, etc) on local storage on compute nodes. This avoids going across the network entirely once data is cached and can give speedups for all queries, not just highly selective queries. Like everything in databases, there are pros and cons to this design, but it's a really good fit for the kinda of interactive analytic workloads we see a lot of (BI, dashboards, ad-hoc querying, etc). As far as I'm aware, benchmarks that report impressive speedups for S3 select on Parquet are not comparing to a system with any kind of data caching, or, necessarily, to an implementation of Parquet that is as optimized as Impala's. Comparing cost is also tricky in benchmarks, because it adds an extra way that you're indirectly paying for compute. Whether it's beneficial for cost/performance depends on a lot of variables, including the workload intensity, compute engine and structure of the files. We've also been tackling selective query perf in different ways. We invested in Parquet page indices as a targeted enhancement for selective queries: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ and have been involved in adding bloom filters to the Parquet standard to help optimize selective queries across all engines and storage systems. We also care a lot about open file formats, open implementations and interoperability between the engines we ship - if we were to add support we'd want to make sure it was 100% compatible. It's hard to ensure that with an opaque implementation of a fairly complex file format that can change underneath us. Our team has got battle-scars from ironing out a lot of the small details of the Parquet file format, and particularly getting predicate evaluation/push-down right so we don't take adding another layer/dimension lightly - getting all the details right and testing that it all works is a lot of work. Anyway, it's a cool technology that's on our radar, but so far it hasn't been the right solution for any of the problems we've tackled for Impala query performance on Parquet.

Tim Armstrong · ‎05-10-2020

Impala SQL treats nested collections essentially as tables. If you want to "join" the nested collection with the containing table or collection, you need to use the same alias that you gave that table previously in the FROM list (otherwise it considers it a separate reference to the nested collection) I.e. instead of from complex_struct_array2 t, t.country t2, t.country.city t3 you want to write the following to do the implicit join: from complex_struct_array2 t, t.country t2, t2.city t3

Tim Armstrong · ‎04-27-2020

The good news is that is that we shipped date support in Impala in CDP. https://docs.cloudera.com/runtime/7.0.3/impala-sql-reference/topics/impala-date.html

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: Impala show table stats - total of rows doesn'...

Re: Impala query thrift encoding missing some fiel...

Re: Hadoop 3.1.2 with CDH

Re: Hadoop 3.1.2 with CDH

Re: Impala query thrift encoding missing some fiel...

Re: Impala query thrift encoding missing some fiel...

Re: Impala query thrift encoding missing some fiel...

Re: Does Impala support S3 select push down?

Re: Why Impala return cross join on Array and stru...

Re: Why Won't Impala Support The Date Data Type?