- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Does IMPALA cached the query statistics?
- Labels:
-
Apache Impala
-
Cloudera Manager
Created on ‎04-16-2019 07:57 AM - edited ‎09-16-2022 07:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Friends,
I am trying to run IMPALA queries using query options. I am trying to analyze the resultant query attributes.
What I have observed that when I run the query for the first time, it took long time than I run the query second time or onwards.
I want to know the reason behind difference in time?
Does this happen only when table statistics are not available or it happens all the time.
Is my observation right?
Created ‎04-17-2019 06:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
METADATA" command evicts the table from the cache. Or until the catalog is
restarted.
In 5.16 and 6.1+ there are some non-default options that will evict
metadata after a particular timeout. At some point these will become the
defaults.
Table stats are collected and stored in the hive metastore when you run a
"compute stats" command. They are then just part of the table metadata.
Created ‎04-16-2019 08:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Impala query usually faster on 2nd time than 1st attempt of same query. This is because of OS cache, Which will keep the files in memory and reuse it. It is OS level feature and not specific to Impala.
For further performance improvement, there is a concept of "HDFS caching" which is utilized by Impala.
HDFS Caching helps further to improve the speed of query results
Reference Link below: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_hdfs_caching.html
Thanks
Jerry
Created ‎04-16-2019 04:30 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Impala caches all table metadata, so planning is generally faster once the table has been referenced by a previous query. You can see the "Planner Timeline" in the IMpala query profile to get a time breakdown of planning including metadata loading.
Created ‎04-17-2019 04:40 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Tim,
How long Impala keeps/caches this metadata?
If statistics of tables which are participating in query are not available then will it be available after first run?
What if I run the query after big interval, then also will metadata be available in cache?
What is my cluster or Impala is restarted?
Does Impala perform some activity to get statistics of all the participating tables for the first time if statistics are not available and keep it in metastore or some where in DB?
Created ‎04-17-2019 06:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
METADATA" command evicts the table from the cache. Or until the catalog is
restarted.
In 5.16 and 6.1+ there are some non-default options that will evict
metadata after a particular timeout. At some point these will become the
defaults.
Table stats are collected and stored in the hive metastore when you run a
"compute stats" command. They are then just part of the table metadata.
