Member since
02-07-2019
2746
Posts
241
Kudos Received
31
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2332 | 08-21-2025 10:43 PM | |
| 2688 | 04-15-2025 10:34 PM | |
| 6971 | 10-28-2024 12:37 AM | |
| 2422 | 09-04-2024 07:38 AM | |
| 4416 | 06-10-2024 10:24 PM |
09-02-2025
02:50 AM
In the RDBMS world, query history has existed for many years. When I worked with MS SQL Server, I could easily check who executed which query on the server. In MS SQL, this feature is called Query Store. In Apache Hive and Apache Impala, the equivalent is called Query History.
This feature is primarily used to simplify performance troubleshooting. With SQL queries, you can compare execution plans, check statistics, and analyze many other aspects. It gives you the ability to understand how your data warehouse is being utilized, using data from the data warehouse itself. You can even create your own reports and dashboards on top of this data. It does not stop here! You also gain insights into your query plans, compare past and present performance, and create trends based on historical data.
One of the main motivations for introducing this feature in Apache Hive and Apache Impala was customer escalations in the past. There were many cases where customers complained about performance degradation after a data warehouse upgrade. Sometimes these issues were real, but in many cases, it was only a perception change without actual performance regression. The truth is always in the data. From now on, when customers use Hive or Impala, they can capture query usage and gain insight into cluster and query performance. Often, schema changes or outdated statistics affect performance. With Query History enabled, you can clearly see these changes.
The query history is stored in an Iceberg table. When enabled, this feature introduces some storage requirements and has a slight impact on performance. Every executed query is logged along with its statistics and performance metrics in a large Iceberg table.
Let's See How to Use It
These configs apply to Apache Hive.
Apache Impala uses a different configuration and slightly different strategies, but in this article, we will focus on Apache Hive Query History.
This is by default query history is disabled in Hive. This can be enable with following config:
hive.query.history.enabled
To check the current setting, run this command:
SET hive.query.history.enabled;
Result:
+-----------------------------------+ | set | +-----------------------------------+ | hive.query.history.enabled=false | +-----------------------------------+
To enable it, simply set:
SET hive.query.history.enabled = true;
If you also want to store the detailed query plan, enable:
hive.query.history.explain.plan.enabled
Batch size:
The implementation is designed to be efficient. It does not insert rows one by one. Instead, it uses a buffer and then flushes data periodically (by default, every hour) into the sys.query_history table. You can control this behavior using the following settings:
hive.query.history.batch.size
This defines the maximum number of executed queries to keep in memory before persisting them to the sys table. A small value (5-10) gives near real-time behavior, but it can create many small files. In that case, administrators need to compact the sys table for optimal performance.
For demo purposes, you can set the batch size to 0, which forces synchronous writes. This is not recommended for production.
The default value is 200.
Memory limit
There is another config where you can play with persist part.
hive.query.history.max.memory.bytes
This controls the maximum buffer size in bytes before flushing to the sys table.
Setting it to 0 disables the limit, which can be useful for demos but should not be used in production.
Flush Interval
hive.query.history.flush.interval.seconds
This defines the maximum time interval to flush records from memory to the Iceberg table, even if the batch size is not reached. The default value is 3600 seconds (1 hour), which balances file size and timely availability of data.
Setting it to 0 disables the interval-based flush and forces batch-size-based flushing only.
From an implementation perspective, this service runs inside HiveServer2, because all queries go through HS2. Queries are stored only after completion, regardless of success or failure. Records first go to the buffer and then to the sys table based on the settings. By default, buffering is enabled.
If you are using the hive docker image, you will need to add the following environment configuration to enable the Query History service inside the standalone container. (quick start)
--env SERVICE_OPTS="-Dhive.query.history.enabled=true"
What Data Is Stored?
Well… A lot! The table is wide and includes both basic information and performance/runtime metrics. Examples:
Any many many more!!!
The sys.query_history table is partitioned by cluster_id!
For the full list, check this Apache Hive upstream JIRA attachment.
Hive does not provide a built-in UI or dashboard for Query History. Some RDBMS tools do, but Hive only exposes raw data. You can build any visualization or reporting layer on top of it. Currently, there are no built-in recommendations or AI-driven insights, but this would be a valuable community contribution in the future. But!
The author (abstractdog — Laszlo Bodor) of this feature created some useful SQL that could help for the daily job.
Useful SQL Examples for sys table
Basic information about a specific query
select sql, tez_dag_id, plan, exec_summary from sys.query_history where query_id = 'YOUR_QUERYID';
Top five longest-running queries
select end_user, query_id, total_time_ms from sys.query_history order by total_time_ms desc limit 5;
Number of concurrent queries
SELECT a.query_id, COUNT(b.query_id) AS concurrent_queries FROM sys.query_history a LEFT JOIN sys.query_history b ON a.query_id != b.query_id AND a.start_time <= b.end_time AND b.start_time <= a.end_time GROUP BY a.query_id ORDER BY concurrent_queries DESC LIMIT 5;
These are just starting points. With Query History, you can analyze your cluster using SQL you already know, and you can dive deep into performance trends and usage patterns. When performance problems occur, you will have the data to trace when and how they started.
Query history in Apache Hive was originally published in Engineering@Cloudera on Medium, where people are continuing the conversation by highlighting and responding to this story.
... View more
Labels:
09-01-2025
05:59 AM
Hello dear support team, I’m experiencing the same issue as the original poster and others in the thread. Could you please assist me in updating the email address associated with my account? Thank you very much !
... View more
08-25-2025
09:18 PM
You did it! I changed my tracking strategy to "Tracking timestamps" and it now populated the "View State" window. Thank you very much for your assistance!
... View more
08-21-2025
10:43 PM
@tableau, Did the response assist in resolving your query? If it did, please mark the relevant reply as the solution, as it will help others locate the answer more easily in the future.
... View more
08-21-2025
09:28 AM
1 Kudo
Hello @MoJadallah, Sorry for no one answering so far. We do want to give you a good experience on your trial. I assume you requested your free trial from here: https://www.cloudera.com/products/cloudera-public-cloud-trial.html?internal_keyplay=ALL&internal_campaign=FY25-Q1-GLOBAL-CDP-5-Day-Trial&cid=FY25-Q1-GLOBAL-CDP-5-Day-Trial&internal_link=WWW-Nav-u01 When was that error hit? Just after the sign-up?
... View more
08-08-2025
03:54 PM
@willx @ayushi Hi! By any chance do you have some insights here? Thanks!
... View more
08-07-2025
08:57 AM
Hi Matt, Just to clarify one point—specifically in the context of NiFi REST API 2.0+—is there an endpoint where we can exchange an Azure AD access token for a NiFi access token, similar to a token exchange flow? Or, if such a direct token exchange is not supported (i.e., the token must always be obtained via browser redirection to the NiFi URL), could you please confirm that this is indeed the case? Thanks in advance!
... View more
07-29-2025
06:28 AM
@cerey2787, Welcome to our community! To help you get the best possible answer, I have tagged in our CDP experts @vaishaakb, who may be able to assist you further. Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.
... View more
07-22-2025
02:57 AM
The error ERR_BAD_SSL_CLIENT_AUTH_CERT is a security-related bug that appears in web browsers, especially Google Chrome, due to a failure to interact with the SSL/TLS client authentication process.
This issue arises when SSL/TLS certificates are required on both client and server sides, but the client fails to present a valid certificate or the server can't verify it. Common causes include certificate misconfiguration, incorrect device date/time, software issues, or interference from browser extensions and antivirus programs. There are many different ways to fix the error which you need to understand properly in detail. Furthermore I found the helpful resource at:- https://cheapsslweb.com/resources/how-to-fix-the-err_bad_ssl_client_auth_cert-error.
I hope it helps!
... View more
07-16-2025
10:16 PM
CommunityOverCode Asia 2025 will take place from July 25-27, 2025, in Beijing. CommunityOverCode (formerly ApacheCon) is the official global conference series of the Apache Software Foundation (ASF). Through keynote, case studies, training, and hackathons, the conference showcases the latest developments and innovations across Apache projects.
The Apache ecosystem's core value lies in open-source collaboration and community-driven innovation. Apache project contributors, committers, and technical experts from around the world will gather at CommunityOverCode Asia 2025 to share their latest technical insights and experiences in AI, big data, storage, analytics, and more.
Cloudera's open-source contributors will present across 4 different venues with 7 technical sessions, comprehensively demonstrating how Apache technology stack applications are transforming modern data processing. From data storage to analytics, we'll share how open-source technologies solve complex data challenges.
Below are the technical highlights we're bringing to the conference:
Session 1: Apache NiFi 2.0 - A New Era of Data Flow Processing
Yan Liu Chinese Session 2025-07-27 14:30 GMT+8 (ROOM: Mtn BaiWang Hall)
As a contributor to Apache Hive and Apache Flink, Yan Liu will provide an in-depth introduction to Apache NiFi 2.0's new features. This presentation will focus on two core innovations: NiFi Cloud Native and NiFi Functions, exploring how these new capabilities enable more flexible and scalable data flow processing systems. Drawing from over 10 years of practical experience in big data, Yan Liu will share integration practices of NiFi with Apache Flink, Apache Hive, and Apache Iceberg in real-time data warehouse construction.
Session 2: Data Storage and Computing Infrastructure
Sammi Chen
Chinese Session 2025-07-26 15:45 GMT+8 (ROOM: Mtn WanShou Hall)
Sammi Chen will explore storage and computing infrastructure solutions within the Apache ecosystem. This presentation will analyze how to build distributed storage systems capable of supporting large-scale data processing, covering storage system architecture design, performance optimization, and scalability considerations. Through practical case studies, the session will demonstrate how to deploy and manage highly available storage computing clusters in enterprise environments, establishing solid infrastructure support for subsequent data analysis and processing.
Our data lake sessions comprehensively present the development and application of modern data lake technologies:
Session 3: Enterprise Data Lake Governance and Management
Bill Zhang
English Session 2025-07-25 14:00 GMT+8 (ROOM: WanChun Hall)
Bill Zhang will share governance frameworks and management best practices for enterprise data lakes. This presentation will explore security strategies, access control, and compliance requirements in data lake environments, demonstrating how to implement effective data governance policies in large-scale data environments.
Session 4: Apache Iceberg Metadata Optimization Techniques
Daniel Becker
English Session 2025-07-26 14:30 GMT+8 (ROOM: WanChun Hall)
Daniel Becker will analyze Apache Iceberg's metadata management mechanisms, focusing on how intelligent metadata processing improves query performance. The presentation will cover Iceberg's metadata table functionality, query optimization strategies, and table maintenance workflows, providing attendees with practical performance tuning guidelines.
Session 5: Apache Iceberg Table Lifecycle Management
Bill Zhang
English Session 2025-07-26 16:15 GMT+8 (ROOM: WanChun Hall)
Bill Zhang will comprehensively introduce Apache Iceberg table lifecycle management practices. This session will cover table creation, maintenance, optimization, and archival strategies, highlighting best practices for compression techniques, partition management, and schema evolution. Through these examples, the presentation will help audiences master efficient Iceberg table management methods in production environments.
Session 6: Advanced Data Lake Optimization and Operations
Attila Turóczy
English Session 2025-07-27 15:45 GMT+8 (ROOM: WanChun Hall)
Attila Turóczy will share advanced optimization techniques and operational experience for data lakes. The presentation will focus on cutting-edge performance tuning methods, monitoring system construction, and troubleshooting strategies, helping enterprises achieve excellence in data lake system operations.
Session 7: Apache Impala High-Performance Analytics Engine
Quanlong Huang
Chinese Session 2025-07-26 15:00 GMT+8 (ROOM: Mtn BaiWang Hall)
Quanlong Huang will introduce the latest technical advances of Apache Impala in OLAP and data analytics. This Chinese presentation will highlight Impala's query optimization techniques, distributed execution mechanisms, and integration practices with modern data lake architectures. The session will cover Impala's performance optimization strategies in large-scale data analysis scenarios, including query plan optimization, memory management, and concurrency control, providing technical guidance for enterprises to achieve high-performance real-time analytics.
Apache Technology Ecosystem Implementation
Our presentations form a complete Apache technology ecosystem, demonstrating end-to-end technical practices from data flow processing to high-performance analytics. This series of sessions not only showcases the unique advantages of individual Apache projects but, more importantly, reveals their collaborative effects in modern data architectures.
From a data flow processing perspective, Apache NiFi 2.0's Cloud Native and Functions capabilities provide enterprises with more flexible and scalable data ingestion and processing abilities. This cloud-native design philosophy enables data flow processing to better adapt to modern enterprises' distributed computing needs while simplifying complex data transformation logic implementation through functional programming models.
At the storage infrastructure, the Apache ecosystem provides powerful distributed storage solutions capable of supporting petabyte-scale data storage requirements. Modern enterprise data architectures need to handle structured, semi-structured, and unstructured data simultaneously. Apache's storage technology stack achieves seamless integration and efficient management of different data types through unified interfaces and API designs.
Data lake technology development represents an important trend in modern data architecture. Apache Iceberg, as the next-generation table format, addresses traditional data lake pain points in query performance, data consistency, and table management through its advanced metadata management mechanisms. From enterprise governance to technical implementation, from table lifecycle management to advanced optimization techniques, our data lake technology sessions comprehensively demonstrate how to build and operate enterprise-level modern data lake platforms.
Finally, Apache Impala serves as a high-performance analytics engine, providing powerful query and analysis capabilities for the entire data platform. Its distributed query execution, intelligent optimization, and deep integration with data lakes enable enterprises to achieve near real-time analytical responses on large-scale datasets, meeting modern business demands for immediate data insights.
These topics paint a picture of a complete data platform architecture based on Apache open-source technologies, showcasing how the open-source community provides solid technical foundations for enterprise digital transformation through technological innovation and collaboration.
We cordially invite you to join us on-site to explore the in-depth applications of these technical components and work together to build technology systems that adapt to future development.
Join us at CommunityOverCode Asia 2025!
For more information, please visit: https://asia.communityovercode.org/
... View more