Member since
07-29-2015
535
Posts
141
Kudos Received
103
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 8901 | 12-18-2020 01:46 PM | |
| 5897 | 12-16-2020 12:11 PM | |
| 4636 | 12-07-2020 01:47 PM | |
| 2797 | 12-07-2020 09:21 AM | |
| 1925 | 10-14-2020 11:15 AM |
12-21-2020
09:13 AM
1 Kudo
The versions of Apache Impala in Cloudera are "based on" Apache Impala releases but will often include substantial additional features and fixes. We do a lot of work beyond just repackaging the upstream releases. Please consult the Cloudera release notes for the version of CDH/CDP you're using if you want to understand what features are present. Just looking at the version string won't give you that information. CDH 6.3.4 includes that fix. See https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_634_fixed_issues.html Most new features have been going into CDP only - we have been rolling out new Impala features in CDP public cloud on a fairly continual basis and then these are trickling down into CDP private cloud releases. Some limited features have gone into minor CDH release - i.e. 6.3.0, but we've generally been prioritizing stability there and focusing on CDP. We shipped a preview version of the remote read cache in CDH 6.3.0 - https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_630_new_features.html#impala_new_630 . Ranger support is not in CDH - the general idea there is to allow migrating from Sentry to Ranger as part of the CDH->CDP upgrade process.
... View more
12-18-2020
01:46 PM
1 Kudo
It looks like most likely a bug in the Impala planner with some of the estimated stats being calculated as negative and it triggering that error. I haven't seen exactly this symptom before, but I think it's most likely caused by https://issues.apache.org/jira/browse/IMPALA-7604. This could cause it if the #rows estimate from the aggregation (GROUP BY) before the sort (ORDER BY) overflows and wraps around to being negative. For this to happen, the product of the distinct value count from the columns in the group by would have to be > 2^63. I.e. if you have GROUP BY a, b, c and each column has 10 distinct values, you would get a product of 10 * 10 * 10 = 1000. It's possible that the stats changed somehow and before it wasn't being triggered. CDH6.3.4 is the earliest release with a fix for it. You could work around by dropping stats on one of the tables, but that can have some pretty bad performance implications. You might also be able to work around by tweaking the query, e.g. grouping by different columns.
... View more
12-16-2020
12:11 PM
1 Kudo
In that case - scheduling of remote reads - for Kudu it's based on distributing the work for each scan across nodes as evenly as possible. For Kudu we randomize the assignment somewhat to even things out, but it's distribution is not based on resource availability. I.e. we generate the schedule and then wait for the resources to become available on the nodes we picked. I understand that reversing that (i.e. find available nodes, then distribute work on them) would be desirable in some cases but there are pros and cons of doing that. For remote reads from filesystems/object stores, on more recent versions, we do something a bit different - each file has affinity to a set of executors and we try to schedule it on those so that we're more likely to get hits in the remote data cache.
... View more
12-16-2020
10:17 AM
Are you sure there isn't a stack trace for IllegalStateException in the impala daemon logs or catalog daemon logs? That would help match it to a bug.
... View more
12-15-2020
08:59 PM
You can limit the aggregate memory that any one pool will consume. There isn't exactly a priority option (there's no ability to pre-empt queries once they are running)
... View more
12-14-2020
08:35 AM
1 Kudo
You want to enable memory-based admission control - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html#admission_control . Without that enabled memory reservation for queries is best effort - queries just run and get whatever memory they ask for until memory is exhausted. With it enabled queries will get allocated specific amounts of memory and queries will get queued when memory is low. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_rm_example.html is a good starting point. I'd recommend setting a minimum and maximum memory limit, probably a minimum of ~1GB and a maximum of whatever you're comfortably with a single query being given. I also gave a talk a while ago that gives an overview of some things - https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/73000.html. That all said, scheduling is based on data locality/affinity - the read of each input file is scheduled on a node with local replica of that file. There's also affinity to bias scheduling towards a single replica, so that the same data is read on the same node as much as possible. This minimizes network traffic and maximizes use of the OS buffer cache (i.e. maximises likelihood of reading the data from memory instead of disk).
... View more
12-08-2020
09:36 AM
Glad to help! I'm excited about the S3 changes just cause it simplifies ingestion so much. I add a disclaimer here in case other people read the solution. There's *some* potential for performance impact when disabling s3guard for S3-based tables with large partition counts, just because of the difference in implementation - retrieving the listing from dynamodb may be quicker than retrieving it from S3 in some scenarios.
... View more
12-07-2020
01:47 PM
If you have objects that have been deleted in S3 but are showing up in file listings after refreshing the table (which sounds like the case since you dropped and recreated the table), it's possible that there's some inconsistency between the state in s3guard and the state in s3. https://docs.cloudera.com/runtime/7.0.2/cloud-data-access/topics/cr-cda-s3guard-operational-issues.html has some background on s3guard. I'm not an s3guard expert (it's a layer Impala builds on) so don't have much to add about how you would debug/address this beyond what we have in the docs there. One option to consider is to disable s3guard to avoid it entirely. Very recently S3 improved its consistency model to address the main problems s3guard solved (https://aws.amazon.com/s3/consistency/), so you could try disabling s3guard for that bucket to see if it solves the problem.
... View more
12-07-2020
09:21 AM
Slide 17 here has some rules of thumb - https://blog.cloudera.com/latest-impala-cookbook/ Can you mention what version you're running and whether you have any other non-standard configs set, e.g. load_catalog_in_background. We made some improvements in this area and have added some options in more recent versions.
... View more
11-12-2020
08:05 PM
Impala would probably give you the fastest response time. Personally, I would write a script (Python on whatever) that fetched the queries and just ran them one by one. You could try to combine together the queries in various ways if you really cared about reducing latency (I'm not sure that any of these alternatives would make a massive difference, but maybe some amount). E.g. the following would require only a single scan of the table (although it might be more expensive cause you don't have filtering from the where clause). Select count(case when <where clause 1> then 1 end), count(case when <where clause 2> then 1 end)
from MyTable
... View more