Member since
02-09-2021
119
Posts
2
Kudos Received
1
Solution
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 6691 | 07-27-2023 11:39 PM |
11-20-2024
11:29 PM
1 Kudo
@yangdkny, Did the response help resolve your query? If it did, kindly mark the relevant reply as the solution, as it will aid others in locating the answer more easily in the future.
... View more
11-07-2024
10:29 PM
2 Kudos
Thanks for suggestion, the issue has been resolve, we had aaded new datanode after that we had restarted the namenode, resourcemanager, datanode, node manager, but not hiveserver, because of which configuration was not loaded on hive properly, after restart it started working.
... View more
05-14-2024
01:28 AM
A Hive_on_Tez job goes through the following stages during its execution
Query Submission
When a user submits a Hive query, either through the Hive command line interface (CLI), HiveServer2, or through an application such as Hue, the query is sent to the Hive service for processing
Query Parsing and Compilation
The Hive service parses the submitted query to understand its structure and requirements. It then compiles the query into an execution plan specifying the steps needed to execute it. This execution plan includes details such as the sequence of Tez tasks required to perform the computation.
Tez Session Initialization
Once the optimized execution plan is ready, Hive initializes a Tez session.
The Tez session provides a runtime environment for executing the query using Apache Tez, a framework optimized for processing large-scale data.
This initialization process involves setting up necessary configurations, loading required libraries, and establishing communication channels with the Tez runtime environment.
Why is the Tez session important?
A Tez session is needed to manage resources, maintain session state, optimize query execution, provide fault tolerance, and enable session-level configuration for running Tez jobs efficiently within Apache Hive.
Task Generation
Hive translates the compiled query execution plan into a series of Tez vertices and tasks. Each vertex represents a stage of the query execution, and tasks within vertices represent the actual computation to be performed.
Vertex: A Vertex in a Tez application represents a computational stage or step in a directed acyclic graph (DAG) of data processing tasks.
Vertex encapsulates one or more tasks that perform a particular operation on the data.
Each vertex typically corresponds to a specific data processing operation, such as map tasks or reducing tasks
Vertices are connected to each other through directed edges, forming a Directed Acyclic Graph (DAG).
Map Vertex: A Map vertex typically corresponds to the map phase of data processing. It represents a set of tasks responsible for processing input data in parallel. Map tasks read data from input sources, apply transformations or filters, and produce intermediate key-value pairs as output.
Reduce Vertex: A Reduce vertex corresponds to the reduced phase of data processing. It represents a set of tasks responsible for aggregating and processing intermediate data generated by map tasks. Reduce tasks receive intermediate key-value pairs, perform aggregation or computation, and produce final output data.
Tasks: Tasks are units of work within a vertex, such as map tasks or reduce tasks.
Task Attempt: A Task Attempt refers to an individual attempt to execute a task within a Tez vertex.
Tez DAG Creation:
The tasks and vertices generated by Hive are organized into a Directed Acyclic Graph (DAG), which represents the logical and physical execution plan of the query. The DAG defines the dependencies between tasks and vertices, ensuring that data flows correctly through the computation.
DAG represents the data flow and computation logic of the entire Tez application.
Tez Application Submission:
Once the Tez session is initialized and the DAG is generated, the application is submitted to the YARN ResourceManager and then the DAG is submitted to the Tez session.
Application Execution:
The YARN ResourceManager allocates resources (containers) to the Tez Application Master (AM) based on the requirements
Upon receiving container allocations from the ResourceManager, the Tez AM launches container instances on the allocated nodes.
The AM coordinates the execution of tasks across the allocated containers, ensuring that they are executed efficiently and in the correct order.
Task Execution:
Within each allocated container, Tez launches task executors within these containers to execute the tasks. These tasks perform the actual data processing and computation as specified by the query.
During task execution, data movement occurs between tasks to transfer input and intermediate data.
Map Task: A map task is responsible for processing a portion of the input data in parallel. Map tasks are typically used to transform and filter input data into intermediate key-value pairs. Each map task processes a specific input split of the data, which is a contiguous portion of the input data stored in the Hadoop Distributed File System (HDFS) or other storage systems. Map tasks produce intermediate key-value pairs, where the keys are used to partition and sort the data for subsequent processing by reducing tasks.
Reduce Task: A reduce task is responsible for aggregating and processing intermediate key-value pairs generated by map tasks. Reduce tasks receive intermediate data grouped by keys, typically sorted and partitioned by the map tasks. Reduce tasks aggregate values associated with each key, performing operations such as summing, counting, averaging, or applying user-defined functions. The output of reduce tasks is typically the final result of the computation, which may be stored in a file or passed to subsequent stages of processing.
Query Completion and Result Retrieval:
As tasks complete their execution, they produce intermediate or final results, depending on the query.
The Tez AM monitors the progress of task execution and aggregates the results produced by individual tasks.
Once all tasks have been completed successfully, the Tez application is considered complete,
Resources are released, intermediate data is cleaned up, the final output of the application is stored in the desired destination and the final result (if any) is returned to the user.
... View more
Labels:
11-22-2023
08:37 AM
@MR_KD As this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post. Thanks.
... View more
07-31-2023
09:21 PM
Hi, Sorry to make you waiting. Yes, that solution resolved the issue. But i have 1 more issue. After the data successfully inserted to textfile table. Then i need to insert the data to parquet table from textfile table. But it failed insert to parquet table. Error message: "[FATAL] 10:25:08 vaproject.vmlog_0_4.VMLOG- tHiveRow_1 Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask". What is wrong or missing? Please need your help. Thank you. Sincerely, Gideon Maruli IT Data Management - Bank Mayapada
... View more
07-24-2023
10:27 PM
@hanumanth, did the responses from@udeshmukh assist you in resolving your concerns? If so, could you kindly mark the most suitable response as the solution? This will be beneficial for other members who might come across a similar issue.
... View more
09-27-2022
06:25 AM
May i know if the table was created from the data that was exported in some other format like 'txt' format or something ? if this is true then, starting from CDP 7.x versions, the default file format is parquet. So, when the table is imported, it will be created in parquet format, but its original files will in 'txt' format.
... View more