About robbiez

robbiez · ‎10-04-2021

Could you provide more details about your OS and Impala ODBC Driver version? I tested Impala ODBC Driver 2.6.13 on Linux. The driver doesn't renew the ticket for each DSN-less connection on my host.

robbiez · ‎08-29-2021

In my opinion, you can start from checking the cached file metadata in Impala. For example, you can run "show table stats" to find out the number of files and the total file size under this table. You don't need to run "compute stats" in advance as Impala gets the information from NN when it loads metadata for this table. Then you can use "hdfs dfs -ls -R" to list all the files under this table directory. By comparing the output from Impala and HDFS, you can confirm if the file metadata in Impala is stale, or if some partition is missing in Impala. The findings in this step will determine what you should investigate next step.

robbiez · ‎08-25-2021

This issue sounds like Impala had stale metadata. But after Impala starts up, it doesn't have any cached metadata so it will load metadata from HMS and HDFS NN immediately (if load_catalog_in_background is true) or when we run a query. So the metadata should be up-to-date. Is it possible that there is some special workflow in your maintenance routine so the tables are updated after Impala loads the metadata? You can check the timestamp of the files in the tables.

robbiez · ‎08-23-2021

There is a calculation in The Impala Cookbook to estimate the heap memory usage for metadata: • num of tables * 5KB + num of partitions * 2KB + num of files * 750B + num of file blocks * 300B + sum(incremental col stats per table) • Incremental stats For each table, num columns * num partitions * 400B Usually, the insufficient catalog heap memory is caused by a large number of small files or/and partitions. For example, a single 512MB file needs (750B + 4 * 3 * 300B) = 4350B. But if we split this file to 128 4MB files, these files will use (128 * 750B + 128 * 3 * 300B) = 211200B, nearly 50 times!

robbiez · ‎08-19-2021

This is a bug fixed by HIVE-15642. You can work around this issue by setting hive.metastore.dml.events to false.

robbiez · ‎08-10-2021

You need a UDF to parse the XML string and extract the values. As far as I know, Impala doesn't have a native function to do it.

robbiez · ‎07-30-2021

The calculations on constants should be processed in the compiling phase. If you have a look at the query plans as below, you can see that both "months_add(now(), -1)" and "months_add('2020-01-01', -1)" are evaluated already. Impala doesn't need to repeat evaluation for each row. [localhost:21000] default> explain select * from mytable where tran_timestamp > months_add(now(), -1); Query: explain select * from mytable where tran_timestamp > months_add(now(), -1) +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=0B Threads=2 | | Per-Host Resource Estimates: Memory=10MB | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.mytable | | | | PLAN-ROOT SINK | | | | | 01:EXCHANGE [UNPARTITIONED] | | | | | 00:SCAN HDFS [default.mytable] | | HDFS partitions=1/1 files=0 size=0B | | predicates: tran_timestamp > TIMESTAMP '2021-06-30 17:08:33.432148000' | | row-size=16B cardinality=0 | +------------------------------------------------------------------------------------+ Fetched 13 row(s) in 0.99s [localhost:21000] default> explain select * from mytable where tran_timestamp > months_add('2020-01-01', -1); Query: explain select * from mytable where tran_timestamp > months_add('2020-01-01', -1) +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=0B Threads=2 | | Per-Host Resource Estimates: Memory=10MB | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.mytable | | | | PLAN-ROOT SINK | | | | | 01:EXCHANGE [UNPARTITIONED] | | | | | 00:SCAN HDFS [default.mytable] | | HDFS partitions=1/1 files=0 size=0B | | predicates: tran_timestamp > TIMESTAMP '2019-12-01 00:00:00' | | row-size=16B cardinality=0 | +------------------------------------------------------------------------------------+ Fetched 13 row(s) in 0.02s

robbiez · ‎02-07-2020

If a sqoop job failed/crashed in the middle of importing a table, the table is imported. When you run this job again, it will start from zero so you need to clear the partially imported data first. Alternatively, if you know which rows are not imported yet, you can use the WHERE clause when you restart the job to import the rest rows.

robbiez · ‎01-09-2020

Impala shares metadata and data with Hive. You can use sqoop to import the tables from your database into hive. Don't forget to run 'invalidate metadata' in Impala after the ingestion is done. Otherwise, you can't see the imported tables.

robbiez · ‎01-09-2020

Sorry for the delay. Here is an example of HAProxy configuration: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_proxy.html#tut_proxy. Please check if you set balance to SOURCE for ODBC proxy. I used to see some client (I can't remember if it's an ODBC or JDBC client) opened a connection and reused an existing session. If the balance is not set to SOURCE, HAProxy could forward the connection to another impala daemon so the client would hit 'Invalid query handle'. You can enable debug level logging for Impala ODBC Driver. After the issue occurs, you can trace the operations in ODBC driver logs. If the client opened a connection prior to "Invalid query handle", the LB could forward the connection to a wrong impala daemon. Then you can look up impalad logs by the timestamp to figure out which impala daemon the client connected to and check if the impala daemon is the coordinator of the query (it can be found from query profile). Otherwise, you need to trace the query id in the impalad coordinator logs to find out why it returned "Invalid query handle" to the client.

Online	Offline
Last Visited	‎10-07-2021 07:17 PM

Member Since	‎10-10-2017 10:12 PM
Last Visited	‎10-07-2021 07:17 PM
Posts	181
Kudos received	3

Cloudera Community

Re: Impala Catalogue Server OOM

Re: unable to insert overwrite partitioned hive ta...

Re: Calculation of "constants"

Re: Best way to import a database into impala

Re: Best way to import a database into impala

Re: ODBC Driver for Impala with keytab auth Destro...

Re: Impala does not show data accurately after clu...

Re: Impala does not show data accurately after clu...

Re: Impala Catalogue Server OOM

Re: unable to insert overwrite partitioned hive ta...

Re: extract xml fields

Re: Calculation of "constants"

Re: Best way to import a database into impala

Re: Best way to import a database into impala

Re: How to resolve the ERROR::SQLGetInfoW: [Cloud...