We noticed for few of the hive tables, during the maintenance window, just after the cluster restart, impala does not show the data accurately but it shows properly in hive. Once we perform invalidate metadata and refresh on the underlying table, it starts working fine. It was okay to use the work around but this is happening every month consistently after the cluster restart and so we needed to find the root cause and solution. Surprisingly issue is reported for only two tables, one of them is view. These are external hive table with parquet format. How to investigate the issue any further?
This issue sounds like Impala had stale metadata. But after Impala starts up, it doesn't have any cached metadata so it will load metadata from HMS and HDFS NN immediately (if load_catalog_in_background is true) or when we run a query. So the metadata should be up-to-date. Is it possible that there is some special workflow in your maintenance routine so the tables are updated after Impala loads the metadata? You can check the timestamp of the files in the tables.
To return accurate query results, Impala needs to keep the metadata current for the databases and tables queried. Therefore, if some other entity modifies information used by Impala in the metastore, the information cached by Impala must be updated via INVALIDATE METADATA or REFRESH.
Difference between INVALIDATE METADATA and REFRESH
INVALIDATE METADATA [[db_name.]table_name]
You can run it in the HUE or impala-shell
INVALIDATE METADATA product.customer
By default, the cached metadata for all tables is flushed. If you specify a table name, only the metadata for that one table is flushed. Even for a single table.
INVALIDATE METADATA is more expensive than REFRESH, so prefer REFRESH in the common case where you add new data files for an existing table.
Thanks for your reply but during maintenance windows only CDH patching or RHEL patching kind of activities are done and no script runs on the application tables. so as you suggested after the restart, Impala should not have any cached metadata. However the issue is happening consistently every month after the restart. Any suggestion what to check and investigate when next time cluster is restarted and we notice the issue.
That makes sense that the problem only crops up after the maintenance "reboot" of the Metasore host. Once the server is rebooted the metadata is purged from memory that explains the slowness of querries after a cluster restart.
Automatic Invalidation/Refresh of Metadata
Now an available option in CDP 7.2.10 When automatic invalidate/refresh of metadata is enabled, the Catalog Server polls Hive Metastore (HMS) notification events at a configurable interval and automatically applies the changes to Impala catalog.
Impala Catalog Server polls and processes the following changes.
Invalidates the tables when it receives the ALTER TABLE event.
Refreshes the partition when it receives the ALTER, ADD, or DROP partitions.
Adds the tables or databases when it receives the CREATE TABLE or CREATE DATABASE events.
Removes the tables from catalogd when it receives the DROP TABLE or DROP DATABASE events.
The HMS stores metadata for Hive tables schema, permissions, location, and partitions in a relational database providing clients access to this information by using metastore service API.
Hive Metastore is a component in Hive that stores the catalog of the system that contains the metadata about Hive create columns, Hive table creation, and partitions.
Impala uses the HIVE metastore to read the data created in hive, it is possible to read the same and query the same using Impala. All you need is to refresh the table or trigger INVALIDATE METADATA in impala to read the data. Hive and impala are two different query engines.
Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking metadata about schema objects such as tables and columns.
Hive utilizes execution engines (like Tez, Hive on Spark, and LLAP) to improve query performance without low-level tuning approaches. Leveraging parallel execution whenever sequential operations are not needed is also wise. The amount of parallelism that your system can perform depends on the resources available and the overall data structure. Proper Hive tuning allows you to manipulate as little data as possible. One way to do this is through partitioning, where you assign “keys” to subdirectories where your data is segregated.
Impala uses Hive metastore and can query the Hive tables directly. Unlike Hive, Impala does not translate the queries into MapReduce jobs like hive but executes them natively using its daemons running on the data nodes to directly access the files on HDFS .
Created metadata is stored in the Hive Metastore‚ and is contained in an RDBMS such as MySQL/Oracle, MSSQL or MariaDB. Hive and Impala work with the same data tables in HDFS, metadata in the Metastore.
Metadata information of tables created in Hive is stored in Hive "Meta storage database".
Hello @Shelton , thanks for providing the informative details on Hive and Impala but that's not helping with the question I have. I understand that when the cluster restart happens, till the catalogue store is fully refreshed, impala queries would run slow.
My question is not about slow running impala queries after restart. The question is very different and I will try to explain it again. We work on a very big CDH cluster with numerous amount of users and applications running on it. We have monthly maintenance schedule where some patch like RHEL, CDH etc is applied to all the hosts and then system is restarted. Its done in off peak hours ( late night thru early morning. ) Next day morning users of "certain" tables start complaining they are missing data in table as they are "impala" users. They can see few but some are missing. The table has year wise partition so let us say they say they can see 2021, 2020 and 2017 but 2018 and 2019 is missing.
To troubleshoot, we go to beeline prompt and run same query in hive and we can see every thing. We come back to Impala and do invalidate metadata and refresh and its start working for users too. who are checking thru Impala and they think that I am a magician who can do some magic to bring their missing data 🙂
This had been happening since last 6 months and rather than become a magician every time cluster restart, I needed to find the root cause why this is happening only for certain tables. when the user is trying to query and if the meta store is not refreshed, I expect Impala to take its time on first hit and refresh it and then show every thing accurate. Showing inaccurate result or incomplete result was not part of the deal with Impala.
If anyone else has faced similar issue or know what is going on here, please chime in.
I am sorry but you will have to continue being a magician 🙂
If you don't want then you have to teach your users the secret sauce or magic wand.
We too face the same problems with multiple users Spark/Impala/PySpark I have made them add the INVALIDATE METADATA and REFRESH (spark) )command at the start of their queries and that works perfectly
Else the automatic invalidate/refresh of metadata is enabled and available in CDP 7.2.10. As long as Impala depends on HMS that issue will exist 🙂
In my opinion, you can start from checking the cached file metadata in Impala. For example, you can run "show table stats" to find out the number of files and the total file size under this table. You don't need to run "compute stats" in advance as Impala gets the information from NN when it loads metadata for this table. Then you can use "hdfs dfs -ls -R" to list all the files under this table directory. By comparing the output from Impala and HDFS, you can confirm if the file metadata in Impala is stale, or if some partition is missing in Impala. The findings in this step will determine what you should investigate next step.
@npr20202, Has any of the replies helped resolve your issue? If so, kindly mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
This time we did show table stats and show column stats on the table before the issue ( just after the restart and before running invalidate metadata) and then after the restart and we do not notice any difference in output but problem reoccurred this time also. To understand the sequence of events :
(1) Admin team does patching and restart the cluster
(2) application team run a distinct year query and sees out of 31 only 27 rows are present and 4 of the partitions are missing.
(3) as a support team, I run show stats which shows all 31 partitions and number of rows in partitions is correct and size also.
(4) we run invalidate metadata and refresh
(5) support team takes show stats which has same result
(6) app team runs they query again and they able to see all 31 partitions.
we will work with vendor to get further help.