Member since
02-09-2021
116
Posts
1
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3653 | 07-27-2023 11:39 PM |
11-13-2024
05:40 AM
@yangdkny Hive's ALTER TABLE DROP PARTITION statement doesn't directly accept DATE_ADD or similar functions inside the partition specification. Hive expects a static date value (e.g., 'YYYY-MM-DD') in the DROP PARTITION statement, not a function call. As an alternative, You can create an HQL script and in that Calculate the Date Using DATE_ADD and Store it in a Variable: Use the Variable in the ALTER TABLE Statement:
... View more
11-07-2024
09:56 PM
1 Kudo
@rsurti ApplicationMaster is trying to connect to the ResourceManager on the same host (localhost / any interface, which is 0.0.0.0) and as it cannot connect to the RM it is failing. 2024-10-31 15:57:52,837 [INFO] [ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerManager] |ipc.Client|: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. The above suggests a misconfiguration - YARN config files missing / or not having proper contents on those hosts. Have you performed CM>Yarn> Actions> Deploy Client configurations ? If not, could you try this ? @VidyaSargur We might need yarn experts in this.
... View more
05-14-2024
01:28 AM
A Hive_on_Tez job goes through the following stages during its execution
Query Submission
When a user submits a Hive query, either through the Hive command line interface (CLI), HiveServer2, or through an application such as Hue, the query is sent to the Hive service for processing
Query Parsing and Compilation
The Hive service parses the submitted query to understand its structure and requirements. It then compiles the query into an execution plan specifying the steps needed to execute it. This execution plan includes details such as the sequence of Tez tasks required to perform the computation.
Tez Session Initialization
Once the optimized execution plan is ready, Hive initializes a Tez session.
The Tez session provides a runtime environment for executing the query using Apache Tez, a framework optimized for processing large-scale data.
This initialization process involves setting up necessary configurations, loading required libraries, and establishing communication channels with the Tez runtime environment.
Why is the Tez session important?
A Tez session is needed to manage resources, maintain session state, optimize query execution, provide fault tolerance, and enable session-level configuration for running Tez jobs efficiently within Apache Hive.
Task Generation
Hive translates the compiled query execution plan into a series of Tez vertices and tasks. Each vertex represents a stage of the query execution, and tasks within vertices represent the actual computation to be performed.
Vertex: A Vertex in a Tez application represents a computational stage or step in a directed acyclic graph (DAG) of data processing tasks.
Vertex encapsulates one or more tasks that perform a particular operation on the data.
Each vertex typically corresponds to a specific data processing operation, such as map tasks or reducing tasks
Vertices are connected to each other through directed edges, forming a Directed Acyclic Graph (DAG).
Map Vertex: A Map vertex typically corresponds to the map phase of data processing. It represents a set of tasks responsible for processing input data in parallel. Map tasks read data from input sources, apply transformations or filters, and produce intermediate key-value pairs as output.
Reduce Vertex: A Reduce vertex corresponds to the reduced phase of data processing. It represents a set of tasks responsible for aggregating and processing intermediate data generated by map tasks. Reduce tasks receive intermediate key-value pairs, perform aggregation or computation, and produce final output data.
Tasks: Tasks are units of work within a vertex, such as map tasks or reduce tasks.
Task Attempt: A Task Attempt refers to an individual attempt to execute a task within a Tez vertex.
Tez DAG Creation:
The tasks and vertices generated by Hive are organized into a Directed Acyclic Graph (DAG), which represents the logical and physical execution plan of the query. The DAG defines the dependencies between tasks and vertices, ensuring that data flows correctly through the computation.
DAG represents the data flow and computation logic of the entire Tez application.
Tez Application Submission:
Once the Tez session is initialized and the DAG is generated, the application is submitted to the YARN ResourceManager and then the DAG is submitted to the Tez session.
Application Execution:
The YARN ResourceManager allocates resources (containers) to the Tez Application Master (AM) based on the requirements
Upon receiving container allocations from the ResourceManager, the Tez AM launches container instances on the allocated nodes.
The AM coordinates the execution of tasks across the allocated containers, ensuring that they are executed efficiently and in the correct order.
Task Execution:
Within each allocated container, Tez launches task executors within these containers to execute the tasks. These tasks perform the actual data processing and computation as specified by the query.
During task execution, data movement occurs between tasks to transfer input and intermediate data.
Map Task: A map task is responsible for processing a portion of the input data in parallel. Map tasks are typically used to transform and filter input data into intermediate key-value pairs. Each map task processes a specific input split of the data, which is a contiguous portion of the input data stored in the Hadoop Distributed File System (HDFS) or other storage systems. Map tasks produce intermediate key-value pairs, where the keys are used to partition and sort the data for subsequent processing by reducing tasks.
Reduce Task: A reduce task is responsible for aggregating and processing intermediate key-value pairs generated by map tasks. Reduce tasks receive intermediate data grouped by keys, typically sorted and partitioned by the map tasks. Reduce tasks aggregate values associated with each key, performing operations such as summing, counting, averaging, or applying user-defined functions. The output of reduce tasks is typically the final result of the computation, which may be stored in a file or passed to subsequent stages of processing.
Query Completion and Result Retrieval:
As tasks complete their execution, they produce intermediate or final results, depending on the query.
The Tez AM monitors the progress of task execution and aggregates the results produced by individual tasks.
Once all tasks have been completed successfully, the Tez application is considered complete,
Resources are released, intermediate data is cleaned up, the final output of the application is stored in the desired destination and the final result (if any) is returned to the user.
... View more
Labels:
05-14-2024
01:21 AM
Optimizing Hive queries is crucial for achieving better performance and scalability in a data warehouse environment. Here are some tips and best practices for optimizing Hive queries:
Partitioning:
Partitioning your data can significantly improve query performance by reducing the amount of data scanned during query execution.
Partition your tables based on commonly filtered columns, such as date or category.
Use static partitioning for columns with a limited number of distinct values and dynamic partitioning for columns with high cardinality.
Consider using partitioned tables for time-series data to improve query performance for date-range queries.
Bucketing:
Bucketing distributes data into a fixed number of buckets based on the hash value of one or more columns.
Use bucketing to distribute data across files and improve data locality evenly.
Choose the number of buckets wisely based on the size of your data and the available resources.
Bucketing is particularly useful for optimizing join operations and aggregations.
Optimizing Join Operations:
Use map-side joins for small tables that can fit into memory to avoid shuffling data across the network.
Use broadcast joins for joining a small table with a large table, broadcasting the small table to all nodes to avoid data shuffling.
Avoid cross joins (cartesian products) as they can result in a significant increase in data volume and degrade performance.
Optimize join order and join conditions to minimize the amount of data shuffled during join operations.
Column Pruning:
Avoid using SELECT * and explicitly specify only the columns needed for the query results.
Column pruning reduces the amount of data read from disk and improves query performance.
Optimizing File Formats:
Choose appropriate file formats such as ORC or Parquet, which are optimized for query performance and storage efficiency.
These file formats support compression and predicate pushdown, which can further improve query performance.
Statistics Collection:
Collect table and column statistics using the ANALYZE TABLE command to help the query optimizer make better decisions.
Update statistics regularly, especially after data loading or significant data changes.
Tuning Hive Configuration:
Adjust Hive configuration parameters such as memory allocation, parallelism settings, and query execution parameters based on the characteristics of your workload and cluster resources.
Monitor query performance and resource utilization to identify bottlenecks and fine-tune configuration settings accordingly.
... View more
Labels:
07-27-2023
11:39 PM
Hi @itdm_bmi After adding the jar , Instead of ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MultiDelimitSerde' Please try ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
... View more
07-27-2023
01:41 AM
Hi @itdm_bmi Please try to add the 2nd jar "hive-contrib-2.1.1-cdh6.1.1.jar"
... View more
07-27-2023
01:16 AM
Hi @itdm_bmi Can you share the output of ls -l /opt/cloudera/parcels/CDH/jars| grep -i contrib Also, may i know what exact add command you executed ?
... View more
07-19-2023
10:19 PM
Hi @hanumanth The MSCK will run slow for the first time as it has to go through all the partitions and update HMS. You can recommend users to maintain around 10k partitions to avoid slowness/hung in HMS
... View more
07-13-2023
08:48 PM
@hanumanth If its an external table, dropping it will not delete the data from HDFS and when you recreate it, the MSCK again has to go through all the 36k partitions and scan them, this may overload the HMS for time being and can cause slowness.
... View more
06-27-2023
10:24 PM
Hi Hanu, if the MSCK is hanging for a particular table, then there must be some problem on that table, What do you see in HS2 or HMS logs when the command is hung ? Are you seeing any error in logs or ? Can you share the thread which is running MSCK ? Check what that thread is doing, it will give an idea where the problem is coming from.
... View more