On the queries page:
This page lists all running queries, plus any completed queries that are archived in memory. The size of that archive is controlled with the --query_log_size command line parameter
I see GET_COLUMNS which should gather some metadata (tables and column names) from a relational store.
Puzzling is that it runs for every query launched through the Cloudera Impala ODBC driver, but not when using the HUE interface.
It takes around 500 ms for each query, and since thes metadata only changes daily, it would be nice to be able to cache the information.
On the other hand, when running the same query through HUE Impala,
it takes around 100 times more time (500 ms goes to 50 seconds) so these metadata definitely have a positive effect.
The ODBC driver uses the column metadata to help implement some parts of the ODBC spec in my understanding.
The metadata used by the GET_COLUMNS operation should be cached in Impala's metadata cache, at least in most standard configurations that I can think of. The first GET_COLUMNS on a table could be quite slow, since it'll trigger loading all the table metadata, but after that it should be very fast - 500ms seems very slow for a table with cached metadata. Unless there was something like an "INVALIDATE METADATA" in-between.
Can you get a query profile for one of the GET_COLUMNS queries? That would have a timeline of how long the various steps took, like loading table metadata.
What version of Impala are you running? Have you got any non-standard configurations (like different catalog modes)?
There's no INVALIDATE METADATA.
I have the profile, but it does not say so much:
Query (id=fb4068c90a80e5dd:33fefd200000000): Summary: Session ID: 4944df8a0226e9cf:2659b7a91151868c Session Type: HIVESERVER2 HiveServer2 Protocol Version: V6 Start Time: 2020-10-14 16:48:25.334065000 End Time: 2020-10-14 16:48:25.840527000 Query Type: DDL Query State: FINISHED Query Status: OK Impala Version: impalad version 2.11.0-cdh5.14.0 RELEASE (build d68206561bce6b26762d62c01a78e6cd27aa7690) User: SVC9125 Connected User: SVC-SQLs-000499@QAONEADR.LOCAL Delegated User: SVC9125 Network Address: 10.96.148.140:51260 Default Db: discovery_gfccmia Sql Statement: GET_COLUMNS Coordinator: ap-hdpdn10t.oneadr.net:22000 Query Timeline: 507.235ms - First row fetched: 504.979ms (504.979ms) - Unregister query: 506.469ms (1.490ms)
I am not sure about if we have any non-standard configurations. I am a business user and will try to get this from the sys admin.
I noticed we have the load_catalog_in_background set to recommended FALSE. But maybe this recommendation don't work for ODBC connections?
I managed to get some more logging details. To me this looks like a real query, not reading from any cache.
I lost this in my inbox but coming back.
GET_COLUMNS does use some of the same machinery as other queries but it's a metadata-only operation on metadata that can be entirely cached.
Are you saying it consistently takes 500ms even if you run queries back-to-back?
The only thing I can think of is if potentially you have a large number of databases or tables in your catalog. There is a step in the GET_COLUMNS processing where it searches through all the metadata to find something matching the tableName pattern in the request.
Thank you so much. We have been migrated from CDH 5 to CDH 6 (since CDH 5 is anyway closed down).
We will try investigate using your hints.
We have solved this by the combined efforts of Cloudera Support, Microsoft Support, you and us.
GET_COLUMNS takes 4 arguments:
catalogName = IMPALA
schemaName = DB_NAME
tableName = TABLE_NAME
columnName = ….
The way we had configured the external table in MS SQL Polybase 2019, only submitted the IMPALA (default) and TABLE_NAME. Ss soon as we entered the full string: IMPALA.DB_NAME.TABLE_NAME…. the run time reduced with 90 pct.
Thanks you so much for your contribution.
Best regards Henrik