Member since
07-29-2015
535
Posts
140
Kudos Received
102
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1833 | 12-18-2020 01:46 PM | |
1291 | 12-16-2020 12:11 PM | |
794 | 12-07-2020 01:47 PM | |
733 | 12-07-2020 09:21 AM | |
432 | 10-14-2020 11:15 AM |
01-20-2021
09:38 AM
There's a 64kb limit on strings in Kudu but otherwise you can store any binary data in them. https://docs.cloudera.com/documentation/kudu/5-10-x/topics/kudu_known_issues.html#schema_design_limitations
... View more
01-19-2021
09:45 AM
Upgrading to a newer version of Impala will solve most scalability issues that you'd see on Impala 2.9, mostly because of https://blog.cloudera.com/scalability-improvement-of-apache-impala-2-12-0-in-cdh-5-15-0/.
... View more
12-21-2020
09:13 AM
1 Kudo
The versions of Apache Impala in Cloudera are "based on" Apache Impala releases but will often include substantial additional features and fixes. We do a lot of work beyond just repackaging the upstream releases. Please consult the Cloudera release notes for the version of CDH/CDP you're using if you want to understand what features are present. Just looking at the version string won't give you that information. CDH 6.3.4 includes that fix. See https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_634_fixed_issues.html Most new features have been going into CDP only - we have been rolling out new Impala features in CDP public cloud on a fairly continual basis and then these are trickling down into CDP private cloud releases. Some limited features have gone into minor CDH release - i.e. 6.3.0, but we've generally been prioritizing stability there and focusing on CDP. We shipped a preview version of the remote read cache in CDH 6.3.0 - https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_630_new_features.html#impala_new_630 . Ranger support is not in CDH - the general idea there is to allow migrating from Sentry to Ranger as part of the CDH->CDP upgrade process.
... View more
12-21-2020
09:01 AM
We have some background on schema evolution in Parquet in the docs - https://docs.cloudera.com/runtime/7.2.2/impala-reference/topics/impala-parquet.html. See "Schema Evolution for Parquet Tables". Some of the details are specific to Impala but the concepts are the same across engines including Hive and Spark that use parquet tables. At a high level, you can think of the data files being immutable while the table schema evolves. If you add a new column at the end of the table, for example, that updates the table schema but leaves the parquet files unchanged. When the table is queried, the table schema and parquet file schema are reconciled and the new column's values will be all NULL. If you want to modify the existing rows and include new non-NULL values, that would require rewriting the data, e.g. with an INSERT OVERWRITE statement for a partition or a CREATE TABLE .. AS SELECT to create an entirely new table. Keep in mind that traditional Parquet tables are not optimized for workloads with updates - Apache Kudu in particular and also transactional tables in Hive3+ have support for row-level updates that is more convenient/efficient. We definitely don't require rewriting the whole table every time you want to add a column, that would be impractical for large tables!
... View more
12-18-2020
01:46 PM
1 Kudo
It looks like most likely a bug in the Impala planner with some of the estimated stats being calculated as negative and it triggering that error. I haven't seen exactly this symptom before, but I think it's most likely caused by https://issues.apache.org/jira/browse/IMPALA-7604. This could cause it if the #rows estimate from the aggregation (GROUP BY) before the sort (ORDER BY) overflows and wraps around to being negative. For this to happen, the product of the distinct value count from the columns in the group by would have to be > 2^63. I.e. if you have GROUP BY a, b, c and each column has 10 distinct values, you would get a product of 10 * 10 * 10 = 1000. It's possible that the stats changed somehow and before it wasn't being triggered. CDH6.3.4 is the earliest release with a fix for it. You could work around by dropping stats on one of the tables, but that can have some pretty bad performance implications. You might also be able to work around by tweaking the query, e.g. grouping by different columns.
... View more
12-17-2020
01:09 PM
"... or because memory is running low or because the extra scanner threads are not needed to keep up with the consumer from the scan node." is how I should've finished that. Not sure what happened there, I did hit the main points but truncated the last sentence. That part of Impala has had a lot of improvements for performance and observability (i.e. more info in the profile) since CDH5.10, FWIW, I'd guess on a later version this wouldn't be a problem or would be easier to debug at least.
... View more
12-16-2020
01:47 PM
One difference is how fast it's reading from disk - i.e. TotalRawHdfsReadTime . In CDH5.12 that includes both time spend fetching metadata from the HDFS namenode and actually reading the data off disk. If you're saying that it's only slow on one node, that probably rules out HDFS namenode slowness, which is a common cause. So probably it's actually slower doing the I/O. Note: in CDH5.15 we split out the namenode RPC time into TotalRawHdfsOpenTime to make it easier to debug things like this. I don't know exactly why I/O would be slower on that one node, it might require inspecting the host to see what's happening and if there's more CPU or I/O load on that host. We've seen that happen if a node is more heavily loaded than other nodes because of some kind of uneven data distribution. E.g. one file is very frequently accessed, maybe if there's a dimension table that is referenced in many queries. That can sometimes be addressed by setting SCHEDULE_RANDOM_REPLICA as a query hint or query option https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_hints.html or https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_schedule_random_replica.html. Or even by enabling HDFS caching for the problematic table (HDFS caching spreads load across all cached replicas). Another possible cause, based on that profile, is that it's competing for scanner threads with other queries running on the same node - AverageScannerThreadConcurrency is lower in the slow case. This can either be because other concurrent queries grabbed scanner threads first (there's a global soft limit of 3x # cpus per node) or because
... View more
12-16-2020
12:11 PM
1 Kudo
In that case - scheduling of remote reads - for Kudu it's based on distributing the work for each scan across nodes as evenly as possible. For Kudu we randomize the assignment somewhat to even things out, but it's distribution is not based on resource availability. I.e. we generate the schedule and then wait for the resources to become available on the nodes we picked. I understand that reversing that (i.e. find available nodes, then distribute work on them) would be desirable in some cases but there are pros and cons of doing that. For remote reads from filesystems/object stores, on more recent versions, we do something a bit different - each file has affinity to a set of executors and we try to schedule it on those so that we're more likely to get hits in the remote data cache.
... View more
12-16-2020
10:17 AM
Are you sure there isn't a stack trace for IllegalStateException in the impala daemon logs or catalog daemon logs? That would help match it to a bug.
... View more
12-15-2020
08:59 PM
You need to run compute stats on the base tables referenced by the views - compute stats directly on a view isn't supported.
... View more
12-15-2020
08:59 PM
You can limit the aggregate memory that any one pool will consume. There isn't exactly a priority option (there's no ability to pre-empt queries once they are running)
... View more
12-14-2020
08:38 AM
Impala can query views. Computing table stats on tables accessed by Impala queries is necessary to get the best performance, particularly for complex queries. That's probably not the cause of whatever your user saw, but you need to include a query profile or the query status error message at least for us to give any tips about.
... View more
12-14-2020
08:35 AM
1 Kudo
You want to enable memory-based admission control - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html#admission_control . Without that enabled memory reservation for queries is best effort - queries just run and get whatever memory they ask for until memory is exhausted. With it enabled queries will get allocated specific amounts of memory and queries will get queued when memory is low. https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_rm_example.html is a good starting point. I'd recommend setting a minimum and maximum memory limit, probably a minimum of ~1GB and a maximum of whatever you're comfortably with a single query being given. I also gave a talk a while ago that gives an overview of some things - https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/73000.html. That all said, scheduling is based on data locality/affinity - the read of each input file is scheduled on a node with local replica of that file. There's also affinity to bias scheduling towards a single replica, so that the same data is read on the same node as much as possible. This minimizes network traffic and maximizes use of the OS buffer cache (i.e. maximises likelihood of reading the data from memory instead of disk).
... View more
12-10-2020
04:28 PM
Great news!
... View more
12-08-2020
09:36 AM
Glad to help! I'm excited about the S3 changes just cause it simplifies ingestion so much. I add a disclaimer here in case other people read the solution. There's *some* potential for performance impact when disabling s3guard for S3-based tables with large partition counts, just because of the difference in implementation - retrieving the listing from dynamodb may be quicker than retrieving it from S3 in some scenarios.
... View more
12-07-2020
01:47 PM
If you have objects that have been deleted in S3 but are showing up in file listings after refreshing the table (which sounds like the case since you dropped and recreated the table), it's possible that there's some inconsistency between the state in s3guard and the state in s3. https://docs.cloudera.com/runtime/7.0.2/cloud-data-access/topics/cr-cda-s3guard-operational-issues.html has some background on s3guard. I'm not an s3guard expert (it's a layer Impala builds on) so don't have much to add about how you would debug/address this beyond what we have in the docs there. One option to consider is to disable s3guard to avoid it entirely. Very recently S3 improved its consistency model to address the main problems s3guard solved (https://aws.amazon.com/s3/consistency/), so you could try disabling s3guard for that bucket to see if it solves the problem.
... View more
12-07-2020
09:21 AM
Slide 17 here has some rules of thumb - https://blog.cloudera.com/latest-impala-cookbook/ Can you mention what version you're running and whether you have any other non-standard configs set, e.g. load_catalog_in_background. We made some improvements in this area and have added some options in more recent versions.
... View more
12-07-2020
09:17 AM
These are good questions that come up frequently. https://docs.cloudera.com/runtime/7.2.2/administering-kudu/topics/kudu-security-trusted-users.html discusses the issue. In summary, Hive/Impala tables (i.e. those with entries in the Hive Metastore) are authorized in the same way, regardless of whether backing storage is HDFS, S3, Kudu, HBase, etc - the SQL interface does the authorization to confirm that the end user has access to the table, columns, etc, then the service accesses the storage as the privileged user (Impala in this case). In this model, if you create an external Kudu table in Impala and give permissions to a user to access the table via Impala, then they will have permissions to access the data in the underlying Kudu table. The thing that closes the loophole here is that creating the external Kudu table requires very high privileges - ALL permission on SERVER - a regular user can't create an external Kudu table pointed at an arbitrary Kudu cluster or table.
... View more
11-30-2020
09:09 AM
@PyMeH that's not right. The Impala JDBC driver does use the HS2 protocol - JDBC is the java language interface and HS2 is the client-server network protocol. You should be able to use impersonation with JDBC. You'd need to configure Impala to allow a particular user to delegate - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_delegation.html Then there is a DelegationUID option for the driver that I believe specifies the user to delegate to - https://docs.cloudera.com/documentation/other/connectors/impala-jdbc/latest/Cloudera-JDBC-Driver-for-Impala-Install-Guide.pdf
... View more
11-12-2020
08:05 PM
Impala would probably give you the fastest response time. Personally, I would write a script (Python on whatever) that fetched the queries and just ran them one by one. You could try to combine together the queries in various ways if you really cared about reducing latency (I'm not sure that any of these alternatives would make a massive difference, but maybe some amount). E.g. the following would require only a single scan of the table (although it might be more expensive cause you don't have filtering from the where clause). Select count(case when <where clause 1> then 1 end), count(case when <where clause 2> then 1 end)
from MyTable
... View more
10-29-2020
05:18 PM
I lost this in my inbox but coming back. GET_COLUMNS does use some of the same machinery as other queries but it's a metadata-only operation on metadata that can be entirely cached. Are you saying it consistently takes 500ms even if you run queries back-to-back? The only thing I can think of is if potentially you have a large number of databases or tables in your catalog. There is a step in the GET_COLUMNS processing where it searches through all the metadata to find something matching the tableName pattern in the request.
... View more
10-23-2020
10:28 AM
1 Kudo
https://issues.apache.org/jira/browse/IMPALA-8454 is the apache impala jira
... View more
10-21-2020
09:49 PM
1 Kudo
I don't have insight into how to solve your particular problem, but for what it's worth, in later versions of Impala (those included in CDP), Impala will read recursively from directories within the table location.
... View more
10-14-2020
11:15 AM
1 Kudo
On-demand metadata does not exist in C5.14.4. There was a technical preview version in C5.16+ and C6.1+ that had all the core functionality but did not perform optimally for all workloads and had some other limitations. After we got feedback and experience with the feature, we made various tweaks and fixes and in C6.3 we removed the technical preview caveat - https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_metadata.html and there and some important tweaks in patch releases after (i.e. 6.3.3). It is enabled by default in the latest versions of CDP. So basically if you want to experiment and see if it meets your needs, CDH5.16+ works, but CDH6.3.3+ or CDP has the latest and greatest.
... View more
10-13-2020
03:44 PM
@parthkyeah I'd expect so. Sometimes this C++ inter-version compatibility is a bear.
... View more
10-13-2020
09:25 AM
The client (i.e. the ODBC driver being used by your pyodbc program) is closing the insert operation before it finishes. I.e. it's starting the insert query, then closing the insert query before it's finished. I don't know pyodbc well, but I wonder if it's something to do with how it's being used here. The examples I see either commit or fetch rows after execute(). I'd suggest trying either of those things (calling commit() or fetching from the cursor) to see if it forces your program to wait for the insert to succeed.. https://github.com/mkleehammer/pyodbc/wiki/Getting-started
... View more
10-12-2020
09:46 AM
The ODBC driver uses the column metadata to help implement some parts of the ODBC spec in my understanding. The metadata used by the GET_COLUMNS operation should be cached in Impala's metadata cache, at least in most standard configurations that I can think of. The first GET_COLUMNS on a table could be quite slow, since it'll trigger loading all the table metadata, but after that it should be very fast - 500ms seems very slow for a table with cached metadata. Unless there was something like an "INVALIDATE METADATA" in-between. Can you get a query profile for one of the GET_COLUMNS queries? That would have a timeline of how long the various steps took, like loading table metadata. What version of Impala are you running? Have you got any non-standard configurations (like different catalog modes)?
... View more
10-09-2020
10:28 AM
What OS and compiler version are you using to build the UDF? This looks like it is probably a consequence of it being built with a newer gcc version than the one use to build Impala (gcc 4.9.2)
... View more
10-07-2020
02:51 PM
I should also say - If you have a chance to upgrade your cluster, I think your experience with Impala would be improved quite a lot. The last CDH5 release - 5.16.2 is a big jump in scalability, performance and reliability from 5.10. CDH6.3.3 is a big jump beyond that in terms of features, then CDP is another huge step, particularly for metadata performance.
... View more
10-07-2020
01:44 PM
There's no dependency on any of the Cloudera management services. Inserts are also going to depend on the HDFS service being healthy (i.e. namenodes, data nodes, etc). There are various other underlying services that could be in play - Kerberos infrastructure like the KDC, the KMS if you're using certain encryption features, etc. Those logs look like the client didn't actually close the query, so I'd question whether there was something that disrupted the client connect to the impala daemon (e.g. a load balancer was paused, or something happened to the client process).
... View more