Support Questions

HenrikBrondum · ‎10-12-2020

On the queries page:

Queries:
This page lists all running queries, plus any completed queries that are archived in memory. The size of that archive is controlled with the --query_log_size command line parameter

I see GET_COLUMNS which should gather some metadata (tables and column names) from a relational store.

Puzzling is that it runs for every query launched through the Cloudera Impala ODBC driver, but not when using the HUE interface.
It takes around 500 ms for each query, and since thes metadata only changes daily, it would be nice to be able to cache the information.
On the other hand, when running the same query through HUE Impala,
it takes around 100 times more time (500 ms goes to 50 seconds) so these metadata definitely have a positive effect.

Tim Armstrong · ‎10-12-2020

The ODBC driver uses the column metadata to help implement some parts of the ODBC spec in my understanding.

The metadata used by the GET_COLUMNS operation should be cached in Impala's metadata cache, at least in most standard configurations that I can think of. The first GET_COLUMNS on a table could be quite slow, since it'll trigger loading all the table metadata, but after that it should be very fast - 500ms seems very slow for a table with cached metadata. Unless there was something like an "INVALIDATE METADATA" in-between.

Can you get a query profile for one of the GET_COLUMNS queries? That would have a timeline of how long the various steps took, like loading table metadata.

What version of Impala are you running? Have you got any non-standard configurations (like different catalog modes)?

HenrikBrondum · ‎10-14-2020

Tim,

Thanks!

There's no INVALIDATE METADATA.

I have the profile, but it does not say so much:

Query (id=fb4068c90a80e5dd:33fefd200000000):
  Summary:
    Session ID: 4944df8a0226e9cf:2659b7a91151868c
    Session Type: HIVESERVER2
    HiveServer2 Protocol Version: V6
    Start Time: 2020-10-14 16:48:25.334065000
    End Time: 2020-10-14 16:48:25.840527000
    Query Type: DDL
    Query State: FINISHED
    Query Status: OK
    Impala Version: impalad version 2.11.0-cdh5.14.0 RELEASE (build d68206561bce6b26762d62c01a78e6cd27aa7690)
    User: SVC9125
    Connected User: SVC-SQLs-000499@QAONEADR.LOCAL
    Delegated User: SVC9125
    Network Address: 10.96.148.140:51260
    Default Db: discovery_gfccmia
    Sql Statement: GET_COLUMNS
    Coordinator: ap-hdpdn10t.oneadr.net:22000
    Query Timeline: 507.235ms
       - First row fetched: 504.979ms (504.979ms)
       - Unregister query: 506.469ms (1.490ms)

I am not sure about if we have any non-standard configurations. I am a business user and will try to get this from the sys admin.

I noticed we have the load_catalog_in_background set to recommended FALSE. But maybe this recommendation don't work for ODBC connections?

HenrikBrondum · ‎10-15-2020

I managed to get some more logging details. To me this looks like a real query, not reading from any cache.

I1015 15:49:44.158821 24416 impala-hs2-server.cc:591] GetColumns(): request=TGetColumnsReq {
01: sessionHandle (struct) = TSessionHandle {
01: sessionId (struct) = THandleIdentifier {
01: guid (string) = "\xcd\xc8\xb1\xe8\xf0\xa3H[\x82\xf1\"u\xe0~SP",
02: secret (string) = "\xfd\x81y\xa1\xf3*A\xb0\xae\x1e\a\xd7\f\xa3\xf0I",
},
},
04: tableName (string) = "mi\\_report\\_workflow\\_fact",
05: columnName (string) = "%",
}
I1015 15:49:44.638825 24416 impala-server.cc:981] Query f94c41fd14e503dd:10ba472900000000 has timeout of 5m
I1015 15:49:44.638913 24416 impala-hs2-server.cc:604] GetColumns(): return_val=TGetColumnsResp {
01: status (struct) = TStatus {
01: statusCode (i32) = 0,
},
02: operationHandle (struct) = TOperationHandle {
01: operationId (struct) = THandleIdentifier {
01: guid (string) = "\xdd\x03\xe5\x14\xfdAL\xf9\x00\x00\x00\x00)G\xba\x10",
02: secret (string) = "\xdd\x03\xe5\x14\xfdAL\xf9\x00\x00\x00\x00)G\xba\x10",
},
02: operationType (i32) = 6,
03: hasResultSet (bool) = true,
},
}
I1015 15:49:44.640828 24416 impala-hs2-server.cc:699] CloseOperation(): query_id=f94c41fd14e503dd:10ba472900000000
I1015 15:49:44.640846 24416 impala-server.cc:992] UnregisterQuery(): query_id=f94c41fd14e503dd:10ba472900000000
I1015 15:49:44.640856 24416 impala-server.cc:1075] Cancel(): query_id=f94c41fd14e503dd:10ba472900000000

Tim Armstrong · ‎10-29-2020

I lost this in my inbox but coming back.

GET_COLUMNS does use some of the same machinery as other queries but it's a metadata-only operation on metadata that can be entirely cached.

Are you saying it consistently takes 500ms even if you run queries back-to-back?

The only thing I can think of is if potentially you have a large number of databases or tables in your catalog. There is a step in the GET_COLUMNS processing where it searches through all the metadata to find something matching the tableName pattern in the request.

HenrikBrondum · ‎11-04-2020

Tim,

Thank you so much. We have been migrated from CDH 5 to CDH 6 (since CDH 5 is anyway closed down).

We will try investigate using your hints.

HenrikBrondum · ‎12-10-2020

Tim,

We have solved this by the combined efforts of Cloudera Support, Microsoft Support, you and us.

GET_COLUMNS takes 4 arguments:

catalogName = IMPALA

schemaName = DB_NAME

tableName = TABLE_NAME

columnName = ….

The way we had configured the external table in MS SQL Polybase 2019, only submitted the IMPALA (default) and TABLE_NAME. Ss soon as we entered the full string: IMPALA.DB_NAME.TABLE_NAME…. the run time reduced with 90 pct.

Thanks you so much for your contribution.

Best regards Henrik

Tim Armstrong · ‎12-10-2020

Great news!

Cloudera Community

Support Questions

GET_COLUMS when launching queries through ODBC