Member since
07-29-2015
535
Posts
141
Kudos Received
103
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6645 | 12-18-2020 01:46 PM | |
4341 | 12-16-2020 12:11 PM | |
3120 | 12-07-2020 01:47 PM | |
2142 | 12-07-2020 09:21 AM | |
1383 | 10-14-2020 11:15 AM |
07-24-2020
01:21 PM
The row counts reflect the status of the partition or table the last time its stats were updated by "compute stats" in Impala (or analyze in Hive). Or that the stats were updated manually via an alter table. (There are also other cases where stats are updated, e.g. they can be automatically gathered by hive, but those are a few examples). One scenario where this could happen is if a partition was dropped since the last compute stats was run. The stats generally can be out of sync with the # of rows in the underlying table - we don't use them for answering queries, just for query optimization, so it's fine if they're a little inaccurate. If you want to know the accurate counts, you can run queries like select count(*) from table; select count(*) from table where business_date = "13/05/2020" and tec_execution_date = "13/05/2020 20:08;
... View more
07-21-2020
09:36 AM
@hsri it seems like this would merit some more investigation - this was added as a nicety a little while back but it may not be working as expected. If you can reproduce this with a simple query, could you file a bug on Apache Impala? https://cwiki.apache.org/confluence/display/IMPALA/Contributing+to+Impala
... View more
07-20-2020
05:49 PM
I really would suggest looking at whether the particular feature you want are in CDH6.3.3. We do backport a lot of features. E.g the GPU scheduling features for YARN for Hadoop 3.1 were included in CDH 6.2 https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_620_new_features.html#hadoop_new_620. If the question is whether you can run a non-CDH version of Hadoop, and still be running CDH, then the answer is no. Or if non-CDH releases of Hadoop are supported by Cloudera - also no. We only release and support CDH versions that have been fully integrated and tested against the other CDH components. If the question is whether there is a way to take Apache Hadoop release and deploy it in a Cloudera Manager cluster, then no - it's not packaged in the right way
... View more
07-20-2020
12:33 PM
I think you're misunderstanding what CDH is. Hadoop in CDH is not a straight repackaging of an upstream Apache Hadoop release - it is based on an Apache Hadoop release but with a lot of enhancements, security and bug fixes based on our own testing and integration work and our experience working with customers running this in production. Our goal is that it should be more production-ready and battle tested than any Apache Hadoop release. So CDH 6.3.3 includes a lot of the improvements from post-3.0.0 Hadoop versions. If you want to see what was added in each version, the release notes have a lot of info - https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_6_release_notes.html#cdh6_release_notes
... View more
07-15-2020
06:56 PM
Yeah this isn't configurable.
... View more
07-15-2020
01:58 PM
I'm not sure exactly what is going on there then, we could always investigate if we had an example. But I'd expect that using the full profile tree for information about query status, etc is more robust since it's kept up to date throughout the query. The exec_summary is a more recent add-on to the profile and is updated in a different way.
... View more
07-15-2020
12:14 PM
The exec summary isn't always going to be valid. It's only relevant if there was some execution that happened like in SELECT queries or DML. It won't be there for DDL. It's also only updated at certain points in the query running, so may not be present or up to date if the query hits an error or finishes early.
... View more
05-11-2020
09:38 AM
3 Kudos
We don't support it or have near-term plans to support it. We have an alternate strategy for S3 perf in Impala, based on our high-performance native Parquet implementation and our remote data cache. The idea is to cache all frequently-accessed parts of Parquet files (footers, columns referenced by queries, etc) on local storage on compute nodes. This avoids going across the network entirely once data is cached and can give speedups for all queries, not just highly selective queries. Like everything in databases, there are pros and cons to this design, but it's a really good fit for the kinda of interactive analytic workloads we see a lot of (BI, dashboards, ad-hoc querying, etc). As far as I'm aware, benchmarks that report impressive speedups for S3 select on Parquet are not comparing to a system with any kind of data caching, or, necessarily, to an implementation of Parquet that is as optimized as Impala's. Comparing cost is also tricky in benchmarks, because it adds an extra way that you're indirectly paying for compute. Whether it's beneficial for cost/performance depends on a lot of variables, including the workload intensity, compute engine and structure of the files. We've also been tackling selective query perf in different ways. We invested in Parquet page indices as a targeted enhancement for selective queries: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ and have been involved in adding bloom filters to the Parquet standard to help optimize selective queries across all engines and storage systems. We also care a lot about open file formats, open implementations and interoperability between the engines we ship - if we were to add support we'd want to make sure it was 100% compatible. It's hard to ensure that with an opaque implementation of a fairly complex file format that can change underneath us. Our team has got battle-scars from ironing out a lot of the small details of the Parquet file format, and particularly getting predicate evaluation/push-down right so we don't take adding another layer/dimension lightly - getting all the details right and testing that it all works is a lot of work. Anyway, it's a cool technology that's on our radar, but so far it hasn't been the right solution for any of the problems we've tackled for Impala query performance on Parquet.
... View more
05-10-2020
05:43 PM
Impala SQL treats nested collections essentially as tables. If you want to "join" the nested collection with the containing table or collection, you need to use the same alias that you gave that table previously in the FROM list (otherwise it considers it a separate reference to the nested collection) I.e. instead of from complex_struct_array2 t, t.country t2, t.country.city t3 you want to write the following to do the implicit join: from complex_struct_array2 t, t.country t2, t2.city t3
... View more
04-27-2020
08:54 AM
The good news is that is that we shipped date support in Impala in CDP. https://docs.cloudera.com/runtime/7.0.3/impala-sql-reference/topics/impala-date.html
... View more