Member since
07-29-2015
535
Posts
140
Kudos Received
102
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2175 | 12-18-2020 01:46 PM | |
1442 | 12-16-2020 12:11 PM | |
889 | 12-07-2020 01:47 PM | |
833 | 12-07-2020 09:21 AM | |
484 | 10-14-2020 11:15 AM |
10-03-2020
09:40 PM
The docs have a better and more complete explanation of Impala admission control than I could give in a reply here - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html. There's also an example in the same section - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_rm_example.html Min/max memory limits are only available in CDH6.1 and up. if you don't want to or aren't able to fully implement Impala admission control, a partway solution to mitigate against a query using all the memory is to leave max memory unset (so that memory-based admission control is not enabled) and set the default query memory limit on the pool. That just limits the amount of memory any one query can use up.
... View more
10-02-2020
05:45 PM
This query is using up most of the memory on the impala daemon and there is not enough headroom to start your other query. Query(78befceb1eef47:d33db5f200030000): Reservation=47.49 GB ReservationLimit=48.00 GB OtherMemory=293.93 MB Total=47.78 GB Peak=47.81 GB You can restrict memory usage of a query by setting the mem_limit option for that query. If you want to do that globally for all queries in cluster, impala admission control can do that - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html E.g. you could set up memory-based admission control with a min memory limit of 2GB and a max memory limit of 20GB to prevent any one query from taking up all the memory on a node.
... View more
09-22-2020
09:08 AM
1 Kudo
Sentry testing mode would be your only option that I can think of. The problem with using Sentry without Kerberos or LDAP authentication is that it doesn't provide any real security since the client isn't authenticated. So we don't recommend in production because it provides the illusion of security but no security.
... View more
09-21-2020
09:57 AM
1 Kudo
This is definitely a bug. Thanks for the clear report and reproduction. It's not IMPALA-7957 but is somewhat related. This is new to us so I filed https://issues.apache.org/jira/browse/IMPALA-10182 to track it. It looks like it can only happen when you have a UNION ALL, plus subqueries where the same column appears twice in the select list, plus NULL values in those columns. You can work around the issue by removing the duplicated entries in the subquery select list. E.g. the following query is equivalent and returns the expected results. SELECT
MIN(t_53.c_41) c_41,
CAST(NULL AS DOUBLE) c_43,
CAST(NULL AS BIGINT) c_44,
t_53.c2 c2,
t_53.c2 c3s0,
t_53.c4 c4,
t_53.c4 c5s0
FROM
( SELECT
t.productsubcategorykey c_41,
t.productline c2,
t.productsubcategorykey c4
FROM
as_adventure.t1 t
WHERE
true
GROUP BY
2,
3 ) t_53
GROUP BY
4,
5,
6,
7
UNION ALL
SELECT
MIN(t_53.c_41) c_41,
CAST(NULL AS DOUBLE) c_43,
CAST(NULL AS BIGINT) c_44,
t_53.c2 c2,
t_53.c2 c3s0,
t_53.c5s0 c4,
t_53.c5s0 c5s0
FROM
( SELECT
t.productsubcategorykey c_41,
t.productline c2,
t.productsubcategorykey c5s0
FROM
as_adventure.t1 t
WHERE
true
GROUP BY
2,
3) t_53
GROUP BY
4,
5,
6,
7;
... View more
08-23-2020
02:19 PM
You need to cast one of the branches of the else to be a compatible type with the other one. The problem is that both decimal types have the max precision (38) and different scale and neither can be converted automatically to the other without potentially losing precision. A lot of the decimal behaviour such as result types of expressions was changed in CDH6 (and upstream Apache Impala 3.0). https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_decimal.html has a lot of related information.
... View more
08-19-2020
10:22 AM
I'm not aware of any significant regressions in planning time between those versions. There were actually some major improvements for some common types of complex queries with many columns - https://issues.apache.org/jira/browse/IMPALA-4242 So there's no known issue that this obviously maps to (the problem described is quite abstract so take that with a grain of salt). There were a couple of issues related to authorization and Sentry that I initially thought about but I believe had been addressed by 6.3.1 (keep in mind that there are quite a lot of improvements in CDH6.3.1 relative to Impala 3.2.0). Anyway I don't want to speculate too much without even knowing which part of planning may be slow. Can you provide query profiles for those queries? Or if that isn't possible, at least the "Query Timeline" and "Planner Timeline" for the fast and slow queries. Edit: just to be clear, the info you provided about the views was useful, but this seems like it's probably something pretty specific to your queries so it's likely any investigation is going to be most fruitful starting from data about the specific queries in your environment.
... View more
08-15-2020
01:03 PM
I think the reality is now that both are great technologies and the overlap in use cases is pretty big - there are a lot of SQL workloads where either can work. I just wanted to clarify a few points. Impala does support querying complex types from Parquet - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html We also are working on a transparent query retry feature in Impala that should be released soon.
... View more
07-29-2020
10:19 AM
Yes we should be able to prune based on range partitions. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_kudu.html#kudu_partitioning has some examples of how to set up a table with both range and hash partitions. You can specify arbitrary timestamp ranges for the partitions. You can see in the Impala explain plan if your WHERE predicates were converted into kudu pushdown predicates (they're labelled kudu predicates).
... View more
07-28-2020
10:48 AM
Ahh 5.11, there's been so many Impala improvements since then! This happens when the Impala daemon can't load the initial catalog (i.e. database and table metadata). The catalog and statestore roles are both involved in the catalog loading, so if the impala daemon isn't able to communicate with those roles, or those are not started or healthy then that could lead to these symptoms. You should be able to see in Cloudera Manager if they're started and if there are any warnings or errors being flagged. It might also be just that the catalog is slow to load (maybe there's a lot of metadata or something else is unhealthy). You would need to look at the logs of the impala daemon you're connecting and maybe the catalog to see what it's doing and why its slow. I know this doesn't address your immediate problem, but we've seen a lot of these metadata/catalog problem go away with later versions - CDH5.16 or CDH6+, and particularly by moving to a dedicated coordinator/executor topology - https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/impala_dedicated_coordinator.html.
... View more
07-27-2020
08:48 AM
@Mara the previous solution is a bit out of date. We fixed this in CDH 5.14 and up so that clients can't connect until the service is ready. So that would avoid the issue. The issue happened in older versions during the impala daemon startup. It can happen for a longer period when some of the services for the impala cluster (catalog or statestore) are not operational, because the impala daemon can't finish startup in those cases.
... View more
07-24-2020
01:21 PM
The row counts reflect the status of the partition or table the last time its stats were updated by "compute stats" in Impala (or analyze in Hive). Or that the stats were updated manually via an alter table. (There are also other cases where stats are updated, e.g. they can be automatically gathered by hive, but those are a few examples). One scenario where this could happen is if a partition was dropped since the last compute stats was run. The stats generally can be out of sync with the # of rows in the underlying table - we don't use them for answering queries, just for query optimization, so it's fine if they're a little inaccurate. If you want to know the accurate counts, you can run queries like select count(*) from table; select count(*) from table where business_date = "13/05/2020" and tec_execution_date = "13/05/2020 20:08;
... View more
07-21-2020
09:36 AM
@hsri it seems like this would merit some more investigation - this was added as a nicety a little while back but it may not be working as expected. If you can reproduce this with a simple query, could you file a bug on Apache Impala? https://cwiki.apache.org/confluence/display/IMPALA/Contributing+to+Impala
... View more
07-20-2020
05:49 PM
I really would suggest looking at whether the particular feature you want are in CDH6.3.3. We do backport a lot of features. E.g the GPU scheduling features for YARN for Hadoop 3.1 were included in CDH 6.2 https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_620_new_features.html#hadoop_new_620. If the question is whether you can run a non-CDH version of Hadoop, and still be running CDH, then the answer is no. Or if non-CDH releases of Hadoop are supported by Cloudera - also no. We only release and support CDH versions that have been fully integrated and tested against the other CDH components. If the question is whether there is a way to take Apache Hadoop release and deploy it in a Cloudera Manager cluster, then no - it's not packaged in the right way
... View more
07-20-2020
12:33 PM
I think you're misunderstanding what CDH is. Hadoop in CDH is not a straight repackaging of an upstream Apache Hadoop release - it is based on an Apache Hadoop release but with a lot of enhancements, security and bug fixes based on our own testing and integration work and our experience working with customers running this in production. Our goal is that it should be more production-ready and battle tested than any Apache Hadoop release. So CDH 6.3.3 includes a lot of the improvements from post-3.0.0 Hadoop versions. If you want to see what was added in each version, the release notes have a lot of info - https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_6_release_notes.html#cdh6_release_notes
... View more
07-15-2020
06:56 PM
Yeah this isn't configurable.
... View more
07-15-2020
01:58 PM
I'm not sure exactly what is going on there then, we could always investigate if we had an example. But I'd expect that using the full profile tree for information about query status, etc is more robust since it's kept up to date throughout the query. The exec_summary is a more recent add-on to the profile and is updated in a different way.
... View more
07-15-2020
12:14 PM
The exec summary isn't always going to be valid. It's only relevant if there was some execution that happened like in SELECT queries or DML. It won't be there for DDL. It's also only updated at certain points in the query running, so may not be present or up to date if the query hits an error or finishes early.
... View more
06-06-2020
05:52 PM
Cloudera Express included Impala, but we discontinued Cloudera Express - see https://docs.cloudera.com/documentation/enterprise/latest/topics/cm_ag_licenses.html
... View more
05-11-2020
09:38 AM
3 Kudos
We don't support it or have near-term plans to support it. We have an alternate strategy for S3 perf in Impala, based on our high-performance native Parquet implementation and our remote data cache. The idea is to cache all frequently-accessed parts of Parquet files (footers, columns referenced by queries, etc) on local storage on compute nodes. This avoids going across the network entirely once data is cached and can give speedups for all queries, not just highly selective queries. Like everything in databases, there are pros and cons to this design, but it's a really good fit for the kinda of interactive analytic workloads we see a lot of (BI, dashboards, ad-hoc querying, etc). As far as I'm aware, benchmarks that report impressive speedups for S3 select on Parquet are not comparing to a system with any kind of data caching, or, necessarily, to an implementation of Parquet that is as optimized as Impala's. Comparing cost is also tricky in benchmarks, because it adds an extra way that you're indirectly paying for compute. Whether it's beneficial for cost/performance depends on a lot of variables, including the workload intensity, compute engine and structure of the files. We've also been tackling selective query perf in different ways. We invested in Parquet page indices as a targeted enhancement for selective queries: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ and have been involved in adding bloom filters to the Parquet standard to help optimize selective queries across all engines and storage systems. We also care a lot about open file formats, open implementations and interoperability between the engines we ship - if we were to add support we'd want to make sure it was 100% compatible. It's hard to ensure that with an opaque implementation of a fairly complex file format that can change underneath us. Our team has got battle-scars from ironing out a lot of the small details of the Parquet file format, and particularly getting predicate evaluation/push-down right so we don't take adding another layer/dimension lightly - getting all the details right and testing that it all works is a lot of work. Anyway, it's a cool technology that's on our radar, but so far it hasn't been the right solution for any of the problems we've tackled for Impala query performance on Parquet.
... View more
05-10-2020
05:43 PM
Impala SQL treats nested collections essentially as tables. If you want to "join" the nested collection with the containing table or collection, you need to use the same alias that you gave that table previously in the FROM list (otherwise it considers it a separate reference to the nested collection) I.e. instead of from complex_struct_array2 t, t.country t2, t.country.city t3 you want to write the following to do the implicit join: from complex_struct_array2 t, t.country t2, t2.city t3
... View more
04-27-2020
08:54 AM
The good news is that is that we shipped date support in Impala in CDP. https://docs.cloudera.com/runtime/7.0.3/impala-sql-reference/topics/impala-date.html
... View more
04-24-2020
12:23 PM
2 Kudos
I believe that error should be fixed with the most recent releases of Impyla (0.16.1) and thrift_sasl (0.4.2)
... View more
04-21-2020
01:03 PM
We did a wholesale revamp of decimal behaviour going from CDH5 to CDH6. The default behaviour all changed in CDH6.0: https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_600_new_features.html#decimal_v2 There's a whole epic JIRA capturing the changes: https://issues.apache.org/jira/browse/IMPALA-4072 . I think https://issues.apache.org/jira/browse/IMPALA-4370 might be the specific fix that you're seeing, based on your analysis. The fix version for that change is Impala 2.9.0, so the code change is in CDH5.15.2, but it was done behind the DECIMAL_V2 query option, which wasn't a supported option until CDH6. IN CDH6 you can toggle the behaviour with the DECIMAL_V2 query option (it will eventually be removed, but was kept for backward compatibility).
... View more
04-15-2020
10:51 AM
To clarify my previous answer, you get KRPC by installing CDH5.15+ or CDH6.1+. CDH6.0 does not support KRPC.
... View more
04-15-2020
10:03 AM
Can you provide more information about your version and how the table was created. Ideally "show create table <table>" output. The answer depends a lot on those things because the transactional table support has evolved a lot in recent versions and there are several variants of transactional tables.
... View more
04-09-2020
09:34 AM
Impala is a part of the CDH parcel so there's no way to mix and match the Impala component version with other component versions. I.e. to get Impala 3.1 features you need to upgrade to CDH6.1 or greater. If you can't do a major version upgrade, Impala in CDH5.16.2 is a big improvement in many ways over CDH5.14
... View more
04-09-2020
09:30 AM
To be clear, most production setups with dedicated coordinators that we see given the coordinator the same amount of memory as the executors.
... View more
04-09-2020
09:29 AM
In Impala in CDH6, queries reserve the same amount of memory on the coordinator as on the executors. I.e. coordinators need to be given enough memory so that queries can reserve the same amount on the coordinator as executors. So the config with 1 8mb coordinator and 2 128gb executors won't work well.
... View more
04-09-2020
09:25 AM
1 Kudo
We released support for CDH5.16.2 and CDH6.2+ last year, so you would need to upgrade a minor version but not a major version: https://blog.cloudera.com/apache-phoenix-for-cdh/
... View more
03-27-2020
09:39 AM
I'm not aware of any plans, we've only been doing maintenance releases on CDH5
... View more