Member since
02-09-2021
119
Posts
1
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3949 | 07-27-2023 11:39 PM |
11-20-2024
11:29 PM
1 Kudo
@yangdkny, Did the response help resolve your query? If it did, kindly mark the relevant reply as the solution, as it will aid others in locating the answer more easily in the future.
... View more
11-07-2024
10:29 PM
2 Kudos
Thanks for suggestion, the issue has been resolve, we had aaded new datanode after that we had restarted the namenode, resourcemanager, datanode, node manager, but not hiveserver, because of which configuration was not loaded on hive properly, after restart it started working.
... View more
05-14-2024
01:28 AM
A Hive_on_Tez job goes through the following stages during its execution
Query Submission
When a user submits a Hive query, either through the Hive command line interface (CLI), HiveServer2, or through an application such as Hue, the query is sent to the Hive service for processing
Query Parsing and Compilation
The Hive service parses the submitted query to understand its structure and requirements. It then compiles the query into an execution plan specifying the steps needed to execute it. This execution plan includes details such as the sequence of Tez tasks required to perform the computation.
Tez Session Initialization
Once the optimized execution plan is ready, Hive initializes a Tez session.
The Tez session provides a runtime environment for executing the query using Apache Tez, a framework optimized for processing large-scale data.
This initialization process involves setting up necessary configurations, loading required libraries, and establishing communication channels with the Tez runtime environment.
Why is the Tez session important?
A Tez session is needed to manage resources, maintain session state, optimize query execution, provide fault tolerance, and enable session-level configuration for running Tez jobs efficiently within Apache Hive.
Task Generation
Hive translates the compiled query execution plan into a series of Tez vertices and tasks. Each vertex represents a stage of the query execution, and tasks within vertices represent the actual computation to be performed.
Vertex: A Vertex in a Tez application represents a computational stage or step in a directed acyclic graph (DAG) of data processing tasks.
Vertex encapsulates one or more tasks that perform a particular operation on the data.
Each vertex typically corresponds to a specific data processing operation, such as map tasks or reducing tasks
Vertices are connected to each other through directed edges, forming a Directed Acyclic Graph (DAG).
Map Vertex: A Map vertex typically corresponds to the map phase of data processing. It represents a set of tasks responsible for processing input data in parallel. Map tasks read data from input sources, apply transformations or filters, and produce intermediate key-value pairs as output.
Reduce Vertex: A Reduce vertex corresponds to the reduced phase of data processing. It represents a set of tasks responsible for aggregating and processing intermediate data generated by map tasks. Reduce tasks receive intermediate key-value pairs, perform aggregation or computation, and produce final output data.
Tasks: Tasks are units of work within a vertex, such as map tasks or reduce tasks.
Task Attempt: A Task Attempt refers to an individual attempt to execute a task within a Tez vertex.
Tez DAG Creation:
The tasks and vertices generated by Hive are organized into a Directed Acyclic Graph (DAG), which represents the logical and physical execution plan of the query. The DAG defines the dependencies between tasks and vertices, ensuring that data flows correctly through the computation.
DAG represents the data flow and computation logic of the entire Tez application.
Tez Application Submission:
Once the Tez session is initialized and the DAG is generated, the application is submitted to the YARN ResourceManager and then the DAG is submitted to the Tez session.
Application Execution:
The YARN ResourceManager allocates resources (containers) to the Tez Application Master (AM) based on the requirements
Upon receiving container allocations from the ResourceManager, the Tez AM launches container instances on the allocated nodes.
The AM coordinates the execution of tasks across the allocated containers, ensuring that they are executed efficiently and in the correct order.
Task Execution:
Within each allocated container, Tez launches task executors within these containers to execute the tasks. These tasks perform the actual data processing and computation as specified by the query.
During task execution, data movement occurs between tasks to transfer input and intermediate data.
Map Task: A map task is responsible for processing a portion of the input data in parallel. Map tasks are typically used to transform and filter input data into intermediate key-value pairs. Each map task processes a specific input split of the data, which is a contiguous portion of the input data stored in the Hadoop Distributed File System (HDFS) or other storage systems. Map tasks produce intermediate key-value pairs, where the keys are used to partition and sort the data for subsequent processing by reducing tasks.
Reduce Task: A reduce task is responsible for aggregating and processing intermediate key-value pairs generated by map tasks. Reduce tasks receive intermediate data grouped by keys, typically sorted and partitioned by the map tasks. Reduce tasks aggregate values associated with each key, performing operations such as summing, counting, averaging, or applying user-defined functions. The output of reduce tasks is typically the final result of the computation, which may be stored in a file or passed to subsequent stages of processing.
Query Completion and Result Retrieval:
As tasks complete their execution, they produce intermediate or final results, depending on the query.
The Tez AM monitors the progress of task execution and aggregates the results produced by individual tasks.
Once all tasks have been completed successfully, the Tez application is considered complete,
Resources are released, intermediate data is cleaned up, the final output of the application is stored in the desired destination and the final result (if any) is returned to the user.
... View more
Labels:
05-14-2024
01:21 AM
Optimizing Hive queries is crucial for achieving better performance and scalability in a data warehouse environment. Here are some tips and best practices for optimizing Hive queries:
Partitioning:
Partitioning your data can significantly improve query performance by reducing the amount of data scanned during query execution.
Partition your tables based on commonly filtered columns, such as date or category.
Use static partitioning for columns with a limited number of distinct values and dynamic partitioning for columns with high cardinality.
Consider using partitioned tables for time-series data to improve query performance for date-range queries.
Bucketing:
Bucketing distributes data into a fixed number of buckets based on the hash value of one or more columns.
Use bucketing to distribute data across files and improve data locality evenly.
Choose the number of buckets wisely based on the size of your data and the available resources.
Bucketing is particularly useful for optimizing join operations and aggregations.
Optimizing Join Operations:
Use map-side joins for small tables that can fit into memory to avoid shuffling data across the network.
Use broadcast joins for joining a small table with a large table, broadcasting the small table to all nodes to avoid data shuffling.
Avoid cross joins (cartesian products) as they can result in a significant increase in data volume and degrade performance.
Optimize join order and join conditions to minimize the amount of data shuffled during join operations.
Column Pruning:
Avoid using SELECT * and explicitly specify only the columns needed for the query results.
Column pruning reduces the amount of data read from disk and improves query performance.
Optimizing File Formats:
Choose appropriate file formats such as ORC or Parquet, which are optimized for query performance and storage efficiency.
These file formats support compression and predicate pushdown, which can further improve query performance.
Statistics Collection:
Collect table and column statistics using the ANALYZE TABLE command to help the query optimizer make better decisions.
Update statistics regularly, especially after data loading or significant data changes.
Tuning Hive Configuration:
Adjust Hive configuration parameters such as memory allocation, parallelism settings, and query execution parameters based on the characteristics of your workload and cluster resources.
Monitor query performance and resource utilization to identify bottlenecks and fine-tune configuration settings accordingly.
... View more
Labels:
11-22-2023
08:37 AM
@MR_KD As this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post. Thanks.
... View more
07-31-2023
09:21 PM
Hi, Sorry to make you waiting. Yes, that solution resolved the issue. But i have 1 more issue. After the data successfully inserted to textfile table. Then i need to insert the data to parquet table from textfile table. But it failed insert to parquet table. Error message: "[FATAL] 10:25:08 vaproject.vmlog_0_4.VMLOG- tHiveRow_1 Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask". What is wrong or missing? Please need your help. Thank you. Sincerely, Gideon Maruli IT Data Management - Bank Mayapada
... View more
07-24-2023
10:27 PM
@hanumanth, did the responses from@udeshmukh assist you in resolving your concerns? If so, could you kindly mark the most suitable response as the solution? This will be beneficial for other members who might come across a similar issue.
... View more
09-27-2022
06:25 AM
May i know if the table was created from the data that was exported in some other format like 'txt' format or something ? if this is true then, starting from CDP 7.x versions, the default file format is parquet. So, when the table is imported, it will be created in parquet format, but its original files will in 'txt' format.
... View more
04-06-2021
02:50 AM
Hello Michael, In your insert query are you referring to table like "databasename.tablename"?, if yes, then first try to execute "use databasename" And then execute the query.
... View more