Created on 05-14-2024 01:28 AM
A Hive_on_Tez job goes through the following stages during its execution
Query Submission
When a user submits a Hive query, either through the Hive command line interface (CLI), HiveServer2, or through an application such as Hue, the query is sent to the Hive service for processing
Query Parsing and Compilation
The Hive service parses the submitted query to understand its structure and requirements. It then compiles the query into an execution plan specifying the steps needed to execute it. This execution plan includes details such as the sequence of Tez tasks required to perform the computation.
Tez Session Initialization
Once the optimized execution plan is ready, Hive initializes a Tez session.
The Tez session provides a runtime environment for executing the query using Apache Tez, a framework optimized for processing large-scale data.
This initialization process involves setting up necessary configurations, loading required libraries, and establishing communication channels with the Tez runtime environment.
Why is the Tez session important?
A Tez session is needed to manage resources, maintain session state, optimize query execution, provide fault tolerance, and enable session-level configuration for running Tez jobs efficiently within Apache Hive.
Task Generation
Hive translates the compiled query execution plan into a series of Tez vertices and tasks. Each vertex represents a stage of the query execution, and tasks within vertices represent the actual computation to be performed.
Vertex: A Vertex in a Tez application represents a computational stage or step in a directed acyclic graph (DAG) of data processing tasks.
Vertex encapsulates one or more tasks that perform a particular operation on the data.
Map Vertex: A Map vertex typically corresponds to the map phase of data processing. It represents a set of tasks responsible for processing input data in parallel. Map tasks read data from input sources, apply transformations or filters, and produce intermediate key-value pairs as output.
Reduce Vertex: A Reduce vertex corresponds to the reduced phase of data processing. It represents a set of tasks responsible for aggregating and processing intermediate data generated by map tasks. Reduce tasks receive intermediate key-value pairs, perform aggregation or computation, and produce final output data.
Tasks: Tasks are units of work within a vertex, such as map tasks or reduce tasks.
Task Attempt: A Task Attempt refers to an individual attempt to execute a task within a Tez vertex.
Tez DAG Creation:
The tasks and vertices generated by Hive are organized into a Directed Acyclic Graph (DAG), which represents the logical and physical execution plan of the query. The DAG defines the dependencies between tasks and vertices, ensuring that data flows correctly through the computation.
DAG represents the data flow and computation logic of the entire Tez application.
Tez Application Submission:
Once the Tez session is initialized and the DAG is generated, the application is submitted to the YARN ResourceManager and then the DAG is submitted to the Tez session.
Application Execution:
The YARN ResourceManager allocates resources (containers) to the Tez Application Master (AM) based on the requirements
Upon receiving container allocations from the ResourceManager, the Tez AM launches container instances on the allocated nodes.
The AM coordinates the execution of tasks across the allocated containers, ensuring that they are executed efficiently and in the correct order.
Task Execution:
Within each allocated container, Tez launches task executors within these containers to execute the tasks. These tasks perform the actual data processing and computation as specified by the query.
During task execution, data movement occurs between tasks to transfer input and intermediate data.
Map Task: A map task is responsible for processing a portion of the input data in parallel. Map tasks are typically used to transform and filter input data into intermediate key-value pairs. Each map task processes a specific input split of the data, which is a contiguous portion of the input data stored in the Hadoop Distributed File System (HDFS) or other storage systems. Map tasks produce intermediate key-value pairs, where the keys are used to partition and sort the data for subsequent processing by reducing tasks.
Reduce Task: A reduce task is responsible for aggregating and processing intermediate key-value pairs generated by map tasks. Reduce tasks receive intermediate data grouped by keys, typically sorted and partitioned by the map tasks. Reduce tasks aggregate values associated with each key, performing operations such as summing, counting, averaging, or applying user-defined functions. The output of reduce tasks is typically the final result of the computation, which may be stored in a file or passed to subsequent stages of processing.
Query Completion and Result Retrieval:
As tasks complete their execution, they produce intermediate or final results, depending on the query.
The Tez AM monitors the progress of task execution and aggregates the results produced by individual tasks.
Once all tasks have been completed successfully, the Tez application is considered complete,
Resources are released, intermediate data is cleaned up, the final output of the application is stored in the desired destination and the final result (if any) is returned to the user.