Member since
07-29-2015
535
Posts
141
Kudos Received
103
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 8901 | 12-18-2020 01:46 PM | |
| 5898 | 12-16-2020 12:11 PM | |
| 4638 | 12-07-2020 01:47 PM | |
| 2797 | 12-07-2020 09:21 AM | |
| 1927 | 10-14-2020 11:15 AM |
10-14-2020
11:15 AM
1 Kudo
On-demand metadata does not exist in C5.14.4. There was a technical preview version in C5.16+ and C6.1+ that had all the core functionality but did not perform optimally for all workloads and had some other limitations. After we got feedback and experience with the feature, we made various tweaks and fixes and in C6.3 we removed the technical preview caveat - https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_metadata.html and there and some important tweaks in patch releases after (i.e. 6.3.3). It is enabled by default in the latest versions of CDP. So basically if you want to experiment and see if it meets your needs, CDH5.16+ works, but CDH6.3.3+ or CDP has the latest and greatest.
... View more
10-13-2020
03:44 PM
@parthkyeah I'd expect so. Sometimes this C++ inter-version compatibility is a bear.
... View more
10-09-2020
10:28 AM
What OS and compiler version are you using to build the UDF? This looks like it is probably a consequence of it being built with a newer gcc version than the one use to build Impala (gcc 4.9.2)
... View more
10-03-2020
09:40 PM
The docs have a better and more complete explanation of Impala admission control than I could give in a reply here - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html. There's also an example in the same section - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_rm_example.html Min/max memory limits are only available in CDH6.1 and up. if you don't want to or aren't able to fully implement Impala admission control, a partway solution to mitigate against a query using all the memory is to leave max memory unset (so that memory-based admission control is not enabled) and set the default query memory limit on the pool. That just limits the amount of memory any one query can use up.
... View more
10-02-2020
05:45 PM
This query is using up most of the memory on the impala daemon and there is not enough headroom to start your other query. Query(78befceb1eef47:d33db5f200030000): Reservation=47.49 GB ReservationLimit=48.00 GB OtherMemory=293.93 MB Total=47.78 GB Peak=47.81 GB You can restrict memory usage of a query by setting the mem_limit option for that query. If you want to do that globally for all queries in cluster, impala admission control can do that - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html E.g. you could set up memory-based admission control with a min memory limit of 2GB and a max memory limit of 20GB to prevent any one query from taking up all the memory on a node.
... View more
07-24-2020
01:21 PM
The row counts reflect the status of the partition or table the last time its stats were updated by "compute stats" in Impala (or analyze in Hive). Or that the stats were updated manually via an alter table. (There are also other cases where stats are updated, e.g. they can be automatically gathered by hive, but those are a few examples). One scenario where this could happen is if a partition was dropped since the last compute stats was run. The stats generally can be out of sync with the # of rows in the underlying table - we don't use them for answering queries, just for query optimization, so it's fine if they're a little inaccurate. If you want to know the accurate counts, you can run queries like select count(*) from table; select count(*) from table where business_date = "13/05/2020" and tec_execution_date = "13/05/2020 20:08;
... View more
07-20-2020
05:49 PM
I really would suggest looking at whether the particular feature you want are in CDH6.3.3. We do backport a lot of features. E.g the GPU scheduling features for YARN for Hadoop 3.1 were included in CDH 6.2 https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_620_new_features.html#hadoop_new_620. If the question is whether you can run a non-CDH version of Hadoop, and still be running CDH, then the answer is no. Or if non-CDH releases of Hadoop are supported by Cloudera - also no. We only release and support CDH versions that have been fully integrated and tested against the other CDH components. If the question is whether there is a way to take Apache Hadoop release and deploy it in a Cloudera Manager cluster, then no - it's not packaged in the right way
... View more
07-20-2020
12:33 PM
I think you're misunderstanding what CDH is. Hadoop in CDH is not a straight repackaging of an upstream Apache Hadoop release - it is based on an Apache Hadoop release but with a lot of enhancements, security and bug fixes based on our own testing and integration work and our experience working with customers running this in production. Our goal is that it should be more production-ready and battle tested than any Apache Hadoop release. So CDH 6.3.3 includes a lot of the improvements from post-3.0.0 Hadoop versions. If you want to see what was added in each version, the release notes have a lot of info - https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_6_release_notes.html#cdh6_release_notes
... View more
05-11-2020
09:38 AM
3 Kudos
We don't support it or have near-term plans to support it. We have an alternate strategy for S3 perf in Impala, based on our high-performance native Parquet implementation and our remote data cache. The idea is to cache all frequently-accessed parts of Parquet files (footers, columns referenced by queries, etc) on local storage on compute nodes. This avoids going across the network entirely once data is cached and can give speedups for all queries, not just highly selective queries. Like everything in databases, there are pros and cons to this design, but it's a really good fit for the kinda of interactive analytic workloads we see a lot of (BI, dashboards, ad-hoc querying, etc). As far as I'm aware, benchmarks that report impressive speedups for S3 select on Parquet are not comparing to a system with any kind of data caching, or, necessarily, to an implementation of Parquet that is as optimized as Impala's. Comparing cost is also tricky in benchmarks, because it adds an extra way that you're indirectly paying for compute. Whether it's beneficial for cost/performance depends on a lot of variables, including the workload intensity, compute engine and structure of the files. We've also been tackling selective query perf in different ways. We invested in Parquet page indices as a targeted enhancement for selective queries: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ and have been involved in adding bloom filters to the Parquet standard to help optimize selective queries across all engines and storage systems. We also care a lot about open file formats, open implementations and interoperability between the engines we ship - if we were to add support we'd want to make sure it was 100% compatible. It's hard to ensure that with an opaque implementation of a fairly complex file format that can change underneath us. Our team has got battle-scars from ironing out a lot of the small details of the Parquet file format, and particularly getting predicate evaluation/push-down right so we don't take adding another layer/dimension lightly - getting all the details right and testing that it all works is a lot of work. Anyway, it's a cool technology that's on our radar, but so far it hasn't been the right solution for any of the problems we've tackled for Impala query performance on Parquet.
... View more
05-10-2020
05:43 PM
Impala SQL treats nested collections essentially as tables. If you want to "join" the nested collection with the containing table or collection, you need to use the same alias that you gave that table previously in the FROM list (otherwise it considers it a separate reference to the nested collection) I.e. instead of from complex_struct_array2 t, t.country t2, t.country.city t3 you want to write the following to do the implicit join: from complex_struct_array2 t, t.country t2, t2.city t3
... View more