Created on 09-02-2025 02:50 AM - edited 12-18-2025 08:23 AM
In the RDBMS world, query history has existed for many years. When I worked with MS SQL Server, I could easily check who executed which query on the server. In MS SQL, this feature is called Query Store. In Apache Hive and Apache Impala, the equivalent is called Query History.
This feature is primarily used to simplify performance troubleshooting. With SQL queries, you can compare execution plans, check statistics, and analyze many other aspects. It gives you the ability to understand how your data warehouse is being utilized, using data from the data warehouse itself. You can even create your own reports and dashboards on top of this data. It does not stop here! You also gain insights into your query plans, compare past and present performance, and create trends based on historical data.
One of the main motivations for introducing this feature in Apache Hive and Apache Impala was customer escalations in the past. There were many cases where customers complained about performance degradation after a data warehouse upgrade. Sometimes these issues were real, but in many cases, it was only a perception change without actual performance regression. The truth is always in the data. From now on, when customers use Hive or Impala, they can capture query usage and gain insight into cluster and query performance. Often, schema changes or outdated statistics affect performance. With Query History enabled, you can clearly see these changes.
These configs apply to Apache Hive.
Apache Impala uses a different configuration and slightly different strategies, but in this article, we will focus on Apache Hive Query History.
This is by default query history is disabled in Hive. This can be enable with following config:
hive.query.history.enabled
To check the current setting, run this command:
SET hive.query.history.enabled;
Result:
+-----------------------------------+
| set |
+-----------------------------------+
| hive.query.history.enabled=false |
+-----------------------------------+
To enable it, simply set:
SET hive.query.history.enabled = true;
If you also want to store the detailed query plan, enable:
hive.query.history.explain.plan.enabled
The implementation is designed to be efficient. It does not insert rows one by one. Instead, it uses a buffer and then flushes data periodically (by default, every hour) into the sys.query_history table. You can control this behavior using the following settings:
hive.query.history.batch.size
This defines the maximum number of executed queries to keep in memory before persisting them to the sys table. A small value (5-10) gives near real-time behavior, but it can create many small files. In that case, administrators need to compact the sys table for optimal performance.
For demo purposes, you can set the batch size to 0, which forces synchronous writes. This is not recommended for production.
The default value is 200.
There is another config where you can play with persist part.
hive.query.history.max.memory.bytes
This controls the maximum buffer size in bytes before flushing to the sys table.
Setting it to 0 disables the limit, which can be useful for demos but should not be used in production.
hive.query.history.flush.interval.seconds
This defines the maximum time interval to flush records from memory to the Iceberg table, even if the batch size is not reached. The default value is 3600 seconds (1 hour), which balances file size and timely availability of data.
Setting it to 0 disables the interval-based flush and forces batch-size-based flushing only.
From an implementation perspective, this service runs inside HiveServer2, because all queries go through HS2. Queries are stored only after completion, regardless of success or failure. Records first go to the buffer and then to the sys table based on the settings. By default, buffering is enabled.
If you are using the hive docker image, you will need to add the following environment configuration to enable the Query History service inside the standalone container. (quick start)
--env SERVICE_OPTS="-Dhive.query.history.enabled=true"
Well… A lot! The table is wide and includes both basic information and performance/runtime metrics. Examples:
Any many many more!!!
The sys.query_history table is partitioned by cluster_id!
For the full list, check this Apache Hive upstream JIRA attachment.
Hive does not provide a built-in UI or dashboard for Query History. Some RDBMS tools do, but Hive only exposes raw data. You can build any visualization or reporting layer on top of it. Currently, there are no built-in recommendations or AI-driven insights, but this would be a valuable community contribution in the future. But!
The author (abstractdog — Laszlo Bodor) of this feature created some useful SQL that could help for the daily job.
Basic information about a specific query
select sql, tez_dag_id, plan, exec_summary from sys.query_history where query_id = 'YOUR_QUERYID';
Top five longest-running queries
select end_user, query_id, total_time_ms from sys.query_history order by total_time_ms desc limit 5;
Number of concurrent queries
SELECT a.query_id, COUNT(b.query_id) AS concurrent_queries
FROM sys.query_history a
LEFT JOIN sys.query_history b
ON
a.query_id != b.query_id
AND a.start_time <= b.end_time
AND b.start_time <= a.end_time
GROUP BY a.query_id
ORDER BY concurrent_queries DESC
LIMIT 5;
These are just starting points. With Query History, you can analyze your cluster using SQL you already know, and you can dive deep into performance trends and usage patterns. When performance problems occur, you will have the data to trace when and how they started.
Query history in Apache Hive was originally published in Engineering@Cloudera on Medium, where people are continuing the conversation by highlighting and responding to this story.