Member since
07-29-2015
535
Posts
140
Kudos Received
103
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5811 | 12-18-2020 01:46 PM | |
3740 | 12-16-2020 12:11 PM | |
2653 | 12-07-2020 01:47 PM | |
1896 | 12-07-2020 09:21 AM | |
1229 | 10-14-2020 11:15 AM |
01-20-2021
09:38 AM
There's a 64kb limit on strings in Kudu but otherwise you can store any binary data in them. https://docs.cloudera.com/documentation/kudu/5-10-x/topics/kudu_known_issues.html#schema_design_limitations
... View more
01-19-2021
09:45 AM
Upgrading to a newer version of Impala will solve most scalability issues that you'd see on Impala 2.9, mostly because of https://blog.cloudera.com/scalability-improvement-of-apache-impala-2-12-0-in-cdh-5-15-0/.
... View more
12-21-2020
09:13 AM
1 Kudo
The versions of Apache Impala in Cloudera are "based on" Apache Impala releases but will often include substantial additional features and fixes. We do a lot of work beyond just repackaging the upstream releases. Please consult the Cloudera release notes for the version of CDH/CDP you're using if you want to understand what features are present. Just looking at the version string won't give you that information. CDH 6.3.4 includes that fix. See https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_634_fixed_issues.html Most new features have been going into CDP only - we have been rolling out new Impala features in CDP public cloud on a fairly continual basis and then these are trickling down into CDP private cloud releases. Some limited features have gone into minor CDH release - i.e. 6.3.0, but we've generally been prioritizing stability there and focusing on CDP. We shipped a preview version of the remote read cache in CDH 6.3.0 - https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_630_new_features.html#impala_new_630 . Ranger support is not in CDH - the general idea there is to allow migrating from Sentry to Ranger as part of the CDH->CDP upgrade process.
... View more
12-21-2020
09:01 AM
We have some background on schema evolution in Parquet in the docs - https://docs.cloudera.com/runtime/7.2.2/impala-reference/topics/impala-parquet.html. See "Schema Evolution for Parquet Tables". Some of the details are specific to Impala but the concepts are the same across engines including Hive and Spark that use parquet tables. At a high level, you can think of the data files being immutable while the table schema evolves. If you add a new column at the end of the table, for example, that updates the table schema but leaves the parquet files unchanged. When the table is queried, the table schema and parquet file schema are reconciled and the new column's values will be all NULL. If you want to modify the existing rows and include new non-NULL values, that would require rewriting the data, e.g. with an INSERT OVERWRITE statement for a partition or a CREATE TABLE .. AS SELECT to create an entirely new table. Keep in mind that traditional Parquet tables are not optimized for workloads with updates - Apache Kudu in particular and also transactional tables in Hive3+ have support for row-level updates that is more convenient/efficient. We definitely don't require rewriting the whole table every time you want to add a column, that would be impractical for large tables!
... View more
12-18-2020
01:46 PM
1 Kudo
It looks like most likely a bug in the Impala planner with some of the estimated stats being calculated as negative and it triggering that error. I haven't seen exactly this symptom before, but I think it's most likely caused by https://issues.apache.org/jira/browse/IMPALA-7604. This could cause it if the #rows estimate from the aggregation (GROUP BY) before the sort (ORDER BY) overflows and wraps around to being negative. For this to happen, the product of the distinct value count from the columns in the group by would have to be > 2^63. I.e. if you have GROUP BY a, b, c and each column has 10 distinct values, you would get a product of 10 * 10 * 10 = 1000. It's possible that the stats changed somehow and before it wasn't being triggered. CDH6.3.4 is the earliest release with a fix for it. You could work around by dropping stats on one of the tables, but that can have some pretty bad performance implications. You might also be able to work around by tweaking the query, e.g. grouping by different columns.
... View more
12-17-2020
01:09 PM
"... or because memory is running low or because the extra scanner threads are not needed to keep up with the consumer from the scan node." is how I should've finished that. Not sure what happened there, I did hit the main points but truncated the last sentence. That part of Impala has had a lot of improvements for performance and observability (i.e. more info in the profile) since CDH5.10, FWIW, I'd guess on a later version this wouldn't be a problem or would be easier to debug at least.
... View more
12-16-2020
01:47 PM
One difference is how fast it's reading from disk - i.e. TotalRawHdfsReadTime. In CDH5.12 that includes both time spend fetching metadata from the HDFS namenode and actually reading the data off disk. If you're saying that it's only slow on one node, that probably rules out HDFS namenode slowness, which is a common cause. So probably it's actually slower doing the I/O. Note: in CDH5.15 we split out the namenode RPC time into TotalRawHdfsOpenTime to make it easier to debug things like this. I don't know exactly why I/O would be slower on that one node, it might require inspecting the host to see what's happening and if there's more CPU or I/O load on that host. We've seen that happen if a node is more heavily loaded than other nodes because of some kind of uneven data distribution. E.g. one file is very frequently accessed, maybe if there's a dimension table that is referenced in many queries. That can sometimes be addressed by setting SCHEDULE_RANDOM_REPLICA as a query hint or query option https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_hints.html or https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_schedule_random_replica.html. Or even by enabling HDFS caching for the problematic table (HDFS caching spreads load across all cached replicas). Another possible cause, based on that profile, is that it's competing for scanner threads with other queries running on the same node - AverageScannerThreadConcurrency is lower in the slow case. This can either be because other concurrent queries grabbed scanner threads first (there's a global soft limit of 3x # cpus per node) or because
... View more
12-16-2020
12:11 PM
1 Kudo
In that case - scheduling of remote reads - for Kudu it's based on distributing the work for each scan across nodes as evenly as possible. For Kudu we randomize the assignment somewhat to even things out, but it's distribution is not based on resource availability. I.e. we generate the schedule and then wait for the resources to become available on the nodes we picked. I understand that reversing that (i.e. find available nodes, then distribute work on them) would be desirable in some cases but there are pros and cons of doing that. For remote reads from filesystems/object stores, on more recent versions, we do something a bit different - each file has affinity to a set of executors and we try to schedule it on those so that we're more likely to get hits in the remote data cache.
... View more
12-16-2020
10:17 AM
Are you sure there isn't a stack trace for IllegalStateException in the impala daemon logs or catalog daemon logs? That would help match it to a bug.
... View more
12-15-2020
08:59 PM
You need to run compute stats on the base tables referenced by the views - compute stats directly on a view isn't supported.
... View more