Support Questions

Find answers, ask questions, and share your expertise

Does IMPALA cached the query statistics?

avatar

Hi Friends,
I am trying to run IMPALA queries using query options. I am trying to analyze the resultant query attributes. 
What I have observed that when I run the query for the first time, it took long time than I run the query second time or onwards.
I want to know the reason behind difference in time?
Does this happen only when table statistics are not available or it happens all the time.
Is my observation right? 

1 ACCEPTED SOLUTION

avatar
In it's default configuration, metadata is cached until an "INVALIDATE
METADATA" command evicts the table from the cache. Or until the catalog is
restarted.

In 5.16 and 6.1+ there are some non-default options that will evict
metadata after a particular timeout. At some point these will become the
defaults.

Table stats are collected and stored in the hive metastore when you run a
"compute stats" command. They are then just part of the table metadata.

View solution in original post

4 REPLIES 4

avatar
Expert Contributor
Hi,

Impala query usually faster on 2nd time than 1st attempt of same query. This is because of OS cache, Which will keep the files in memory and reuse it. It is OS level feature and not specific to Impala.

For further performance improvement, there is a concept of "HDFS caching" which is utilized by Impala.

HDFS Caching helps further to improve the speed of query results

Reference Link below: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_hdfs_caching.html

Thanks
Jerry

avatar

Impala caches all table metadata, so planning is generally faster once the table has been referenced by a previous query. You can see the "Planner Timeline" in the IMpala query profile to get a time breakdown of planning including metadata loading.

avatar

Thanks Tim,
How long Impala keeps/caches this metadata? 
If statistics of tables which are participating in query are not available then will it be available after first run?

What if I run the query after big interval, then also will metadata be available in cache? 
What is my cluster or Impala is restarted?

Does Impala perform some activity to get statistics of all the participating tables for the first time if statistics are not available and keep it in metastore or some where in DB?

avatar
In it's default configuration, metadata is cached until an "INVALIDATE
METADATA" command evicts the table from the cache. Or until the catalog is
restarted.

In 5.16 and 6.1+ there are some non-default options that will evict
metadata after a particular timeout. At some point these will become the
defaults.

Table stats are collected and stored in the hive metastore when you run a
"compute stats" command. They are then just part of the table metadata.