About Tim Armstrong

Tim Armstrong · ‎10-29-2020

I lost this in my inbox but coming back. GET_COLUMNS does use some of the same machinery as other queries but it's a metadata-only operation on metadata that can be entirely cached. Are you saying it consistently takes 500ms even if you run queries back-to-back? The only thing I can think of is if potentially you have a large number of databases or tables in your catalog. There is a step in the GET_COLUMNS processing where it searches through all the metadata to find something matching the tableName pattern in the request.

Tim Armstrong · ‎10-23-2020

https://issues.apache.org/jira/browse/IMPALA-8454 is the apache impala jira

Tim Armstrong · ‎10-21-2020

I don't have insight into how to solve your particular problem, but for what it's worth, in later versions of Impala (those included in CDP), Impala will read recursively from directories within the table location.

Tim Armstrong · ‎10-14-2020

On-demand metadata does not exist in C5.14.4. There was a technical preview version in C5.16+ and C6.1+ that had all the core functionality but did not perform optimally for all workloads and had some other limitations. After we got feedback and experience with the feature, we made various tweaks and fixes and in C6.3 we removed the technical preview caveat - https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_metadata.html and there and some important tweaks in patch releases after (i.e. 6.3.3). It is enabled by default in the latest versions of CDP. So basically if you want to experiment and see if it meets your needs, CDH5.16+ works, but CDH6.3.3+ or CDP has the latest and greatest.

Tim Armstrong · ‎10-13-2020

@parthkyeah I'd expect so. Sometimes this C++ inter-version compatibility is a bear.

Tim Armstrong · ‎10-13-2020

The client (i.e. the ODBC driver being used by your pyodbc program) is closing the insert operation before it finishes. I.e. it's starting the insert query, then closing the insert query before it's finished. I don't know pyodbc well, but I wonder if it's something to do with how it's being used here. The examples I see either commit or fetch rows after execute(). I'd suggest trying either of those things (calling commit() or fetching from the cursor) to see if it forces your program to wait for the insert to succeed.. https://github.com/mkleehammer/pyodbc/wiki/Getting-started

Tim Armstrong · ‎10-12-2020

The ODBC driver uses the column metadata to help implement some parts of the ODBC spec in my understanding. The metadata used by the GET_COLUMNS operation should be cached in Impala's metadata cache, at least in most standard configurations that I can think of. The first GET_COLUMNS on a table could be quite slow, since it'll trigger loading all the table metadata, but after that it should be very fast - 500ms seems very slow for a table with cached metadata. Unless there was something like an "INVALIDATE METADATA" in-between. Can you get a query profile for one of the GET_COLUMNS queries? That would have a timeline of how long the various steps took, like loading table metadata. What version of Impala are you running? Have you got any non-standard configurations (like different catalog modes)?

Tim Armstrong · ‎10-09-2020

What OS and compiler version are you using to build the UDF? This looks like it is probably a consequence of it being built with a newer gcc version than the one use to build Impala (gcc 4.9.2)

Tim Armstrong · ‎10-07-2020

I should also say - If you have a chance to upgrade your cluster, I think your experience with Impala would be improved quite a lot. The last CDH5 release - 5.16.2 is a big jump in scalability, performance and reliability from 5.10. CDH6.3.3 is a big jump beyond that in terms of features, then CDP is another huge step, particularly for metadata performance.

Tim Armstrong · ‎10-07-2020

There's no dependency on any of the Cloudera management services. Inserts are also going to depend on the HDFS service being healthy (i.e. namenodes, data nodes, etc). There are various other underlying services that could be in play - Kerberos infrastructure like the KDC, the KMS if you're using certain encryption features, etc. Those logs look like the client didn't actually close the query, so I'd question whether there was something that disrupted the client connect to the impala daemon (e.g. a load balancer was paused, or something happened to the client process).

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	140

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: GET_COLUMS when launching queries through ODBC

Re: Automate Impala views on Hive external table

Re: Automate Impala views on Hive external table

Re: Impala - On-demand metadata

Re: Impala UDF unable to load the .so file from HD...

Re: Impala query over odbc canceled (don't know th...

Re: GET_COLUMS when launching queries through ODBC

Re: Impala UDF unable to load the .so file from HD...

Re: Impala DML frozen on CDH manager frozen - hidd...

Re: Impala DML frozen on CDH manager frozen - hidd...