Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Does IMPALA cached the query statistics?

Solved Go to solution
Highlighted

Does IMPALA cached the query statistics?

Hi Friends,
I am trying to run IMPALA queries using query options. I am trying to analyze the resultant query attributes. 
What I have observed that when I run the query for the first time, it took long time than I run the query second time or onwards.
I want to know the reason behind difference in time?
Does this happen only when table statistics are not available or it happens all the time.
Is my observation right? 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Does IMPALA cached the query statistics?

Master Collaborator
In it's default configuration, metadata is cached until an "INVALIDATE
METADATA" command evicts the table from the cache. Or until the catalog is
restarted.

In 5.16 and 6.1+ there are some non-default options that will evict
metadata after a particular timeout. At some point these will become the
defaults.

Table stats are collected and stored in the hive metastore when you run a
"compute stats" command. They are then just part of the table metadata.
4 REPLIES 4

Re: Does IMPALA cached the query statistics?

Rising Star
Hi,

Impala query usually faster on 2nd time than 1st attempt of same query. This is because of OS cache, Which will keep the files in memory and reuse it. It is OS level feature and not specific to Impala.

For further performance improvement, there is a concept of "HDFS caching" which is utilized by Impala.

HDFS Caching helps further to improve the speed of query results

Reference Link below: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_hdfs_caching.html

Thanks
Jerry

Re: Does IMPALA cached the query statistics?

Master Collaborator

Impala caches all table metadata, so planning is generally faster once the table has been referenced by a previous query. You can see the "Planner Timeline" in the IMpala query profile to get a time breakdown of planning including metadata loading.

Re: Does IMPALA cached the query statistics?

Thanks Tim,
How long Impala keeps/caches this metadata? 
If statistics of tables which are participating in query are not available then will it be available after first run?

What if I run the query after big interval, then also will metadata be available in cache? 
What is my cluster or Impala is restarted?

Does Impala perform some activity to get statistics of all the participating tables for the first time if statistics are not available and keep it in metastore or some where in DB?

Re: Does IMPALA cached the query statistics?

Master Collaborator
In it's default configuration, metadata is cached until an "INVALIDATE
METADATA" command evicts the table from the cache. Or until the catalog is
restarted.

In 5.16 and 6.1+ there are some non-default options that will evict
metadata after a particular timeout. At some point these will become the
defaults.

Table stats are collected and stored in the hive metastore when you run a
"compute stats" command. They are then just part of the table metadata.
Don't have an account?
Coming from Hortonworks? Activate your account here