Member since
09-16-2021
330
Posts
52
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
239 | 11-10-2024 11:19 PM | |
370 | 10-25-2024 05:02 AM | |
1941 | 09-10-2024 07:50 AM | |
697 | 09-04-2024 05:35 AM | |
1553 | 08-28-2024 12:40 AM |
06-07-2024
04:12 AM
1 Kudo
@jayes Hive itself doesn't offer a built-in command like numfile to directly export table data into a specific number of files. However, you can achieve the same using a couple of approaches: 1. Spark: Read the hive table using sparkSql >>> df=spark.sql("select * from sample_table") if it's managed table use HWC session. In Apache Spark, you can control the number of partitions in a DataFrame using the repartition or coalesce methods. Using coalesce method set the number of partitions for the dataframe. coalesced_df = df.coalesce(5) Write the data >>> coalesced_df.write.parquet("/tmp/coalesced_df") Result [hive@node4 ~]$ hdfs dfs -ls -h /tmp/coalesced_df
Found 6 items
-rw-r--r-- 3 hive supergroup 0 2024-06-07 10:49 /tmp/coalesced_df/_SUCCESS
-rw-r--r-- 3 hive supergroup 135.0 M 2024-06-07 10:48 /tmp/coalesced_df/part-00000-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 200.0 M 2024-06-07 10:48 /tmp/coalesced_df/part-00001-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 68.9 M 2024-06-07 10:49 /tmp/coalesced_df/part-00002-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 155.4 M 2024-06-07 10:49 /tmp/coalesced_df/part-00003-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
-rw-r--r-- 3 hive supergroup 132.9 M 2024-06-07 10:49 /tmp/coalesced_df/part-00004-2e6bf3b8-53fa-4a6e-957e-83769c72e780-c000.snappy.parquet
[hive@node4 ~]$
... View more
06-04-2024
12:02 AM
1 Kudo
The export table command automatically creates a distributed copy (distcp) job when the table location contains a large number of files. This improves efficiency for handling massive datasets. The size of the exported file will match the size of the table data. You can adjust the memory allocated to the distcp job mappers and reducers if needed to optimize performance for your specific data size
... View more
06-02-2024
11:45 PM
When exporting tables to HDFS in Hive, the creation and distribution of data files depend on several factors such as the table structure (partitioned or non-partitioned), the underlying storage format, and cluster configurations. Non-Partitioned Tables For non-partitioned tables, typically, Hive will create a single data file if the data is small enough to fit within a single block of the HDFS file system. In your case, a table with 5000 rows and a single column is likely small enough that Hive writes it into one data file (e.g., 000000_0) Partitioned Tables For partitioned tables, Hive will create separate directories for each partition and within each partition directory, it will create data files. The number of files within each partition directory can depend on the size of the data and the settings of your Hive and HDFS configurations.
... View more
05-28-2024
09:22 PM
When an application or job that typically completes in a short time is taking significantly longer than expected, it's essential to systematically troubleshoot the issue to identify and resolve the bottleneck. Here are some steps and areas to focus on when diagnosing performance issues in such scenarios: 1. Understand the Baseline and Gather Information Historical Performance Data: Compare the current run with previous runs. Identify what has changed in terms of input size, configuration, environment, etc. Logs and Metrics: Gather logs and metrics from the application, YARN ResourceManager, and NodeManager. 2. Monitor Resource Utilization CPU, Memory, and Disk Usage: Check the resource usage on the nodes running the application. High CPU, memory, or disk I/O usage can indicate bottlenecks. Network Utilization: Check network usage, especially if the job involves significant data transfer between nodes. 3. Examine YARN and Application Logs YARN Logs: Access the logs through the YARN ResourceManager web UI. Look for errors, warnings, and unusual delays. Application Master (AM) Logs: Review the AM logs for any signs of retries, timeouts, or other issues. Container Logs: Check the logs of individual containers for errors and performance issues. 4. Check for Resource Contention NodeManager Logs: Look for signs of resource contention, such as high wait times for container allocation. Cluster Load: Check if other jobs are running concurrently and consuming significant resources. 5. Investigate Job Configuration Parallelism: Ensure the job is correctly configured for parallel execution (e.g., number of mappers and reducers in a MapReduce job). Resource Allocation: Verify that the job has sufficient resources allocated (e.g., memory, vCores). 6. Data Skew and Distribution Data Skew: Analyze the input data for skew. Uneven data distribution can cause some tasks to take much longer than others. Task Distribution: Check if certain tasks or stages are taking disproportionately longer. 7. Network and I/O Bottlenecks Shuffle and Sort Phase: In Hadoop and Spark, the shuffle phase can be a bottleneck. Monitor the shuffle performance and look for skew or excessive data transfer. HDFS or Storage I/O: Ensure that the underlying storage (HDFS, S3, etc.) is performing optimally and there are no bottlenecks. 8. Garbage Collection and JVM Tuning GC Logs: If the application is JVM-based, check the garbage collection logs for excessive GC pauses. JVM Heap Size: Verify that the JVM heap size is appropriately configured to avoid frequent GC. 9. Configuration Parameters and Tuning YARN Configuration: Check for misconfigurations in YARN resource allocation settings. Application-specific Tuning: Tune parameters specific to the application framework (e.g., Spark, MapReduce). 10. External Dependencies External Services: If the application interacts with external services (e.g., databases, APIs), ensure they are not the bottleneck. Dependency Failures: Look for timeouts or failures in external service calls. Detailed Steps for Specific Frameworks For Hadoop MapReduce Jobs Check Job History Server: Analyze the job in the Job History Server web UI. Identify slow tasks and investigate their logs. Analyze Task Attempts: Look for tasks that have failed and retried multiple times. Identify tasks with unusually high execution times. For Apache Spark Jobs Spark UI: Use the Spark web UI to analyze stages, tasks, and jobs. Look for stages that have long task durations or high task counts. Executor Logs: Check the logs of individual Spark executors for errors and warnings. Driver Logs: Examine the driver logs for signs of job bottlenecks or delays. Conclusion Systematically troubleshooting a job that is taking longer than usual involves a combination of monitoring resource utilization, examining logs, analyzing job configurations, and investigating data distribution and skew. By following these steps and using the right tools, you can identify and resolve the performance bottlenecks effectively. If the issue persists, consider breaking down the problem further or seeking help from more detailed profiling tools or experts familiar with your specific application framework and environment.
... View more
05-28-2024
09:10 PM
Data Loss: When you perform an INSERT OVERWRITE operation in Hive, it completely replaces the data in the target table or partition. if the data is not correctly inserted, it can result in data loss. Column Qualifiers: HBase stores data in a key-value format with rows, column families, and column qualifiers. Issues with specific column qualifiers could be due to schema mismatches or data type incompatibilities. Upserting Data: Upserting (update or insert) in HBase via Hive can be challenging since Hive primarily supports batch processing and doesn't have native support for upsert operations directly. As HBASE handlers tables are external tables. Best Practices and Troubleshooting Schema Matching: Ensure that the schema of the Hive table and the HBase table matches, especially the data types and column qualifiers. Data Types: Be cautious with data types. HBase stores everything as bytes, so type conversions must be handled properly. Error Handling: Implement proper error handling and logging to identify issues during data insertion.
... View more
05-28-2024
03:39 AM
2 Kudos
The diagnostics message in YARN RM UI indicates that the application has been added to the scheduler but has not yet been activated. The message provides details about the reason for skipping the ApplicationMaster (AM) assignment. Let's break down the components of the message for a better understanding: Diagnostic Message Breakdown Application is added to the scheduler and is not yet activated. This indicates that the application is recognized by the scheduler but hasn't started the process of resource allocation and execution. Skipping AM assignment as cluster resource is empty. The ApplicationMaster (AM) assignment is skipped because there are no available resources in the cluster to fulfill the request. Details: Provides additional information about the resource request and limits. AM Partition = <DEFAULT_PARTITION>; AM Partition: The partition in which the AM is supposed to run. In this case, it's the <DEFAULT_PARTITION>, which typically means the default resource pool for the cluster. AM Resource Request = <memory:2048, Cores:1>; AM Resource Request: The resources requested for the ApplicationMaster. Here, it requests 2048 MB of memory and 1 core. Queue Resource Limit for AM = <memory:0, vCores:0>; Queue Resource Limit for AM: The maximum resources allocated for ApplicationMasters in the queue. In this case, it shows <memory:0, vCores:0>, indicating that there are no resources currently allocated for AMs in the queue. User AM Resource Limit of the queue User AM Resource Limit of the queue: This part of the message is truncated, but it generally refers to the per-user resource limits within the queue. This would typically indicate the maximum resources a single user's applications can consume within the queue. Explanation The diagnostic message suggests that: Resource Scarcity: The cluster currently has no available resources to assign to the ApplicationMaster. This could be due to the cluster being fully utilized or the specific queue not having sufficient resources allocated or available. Queue Limits: The specific queue the application belongs to has its AM resource limits set to zero (<memory:0, vCores:0>), which means no resources are allocated for ApplicationMasters in this queue at the moment. Activation Pending: The application is added to the scheduler, but activation (resource assignment and start) is pending due to the lack of available resources. Possible Causes and Solutions Cluster Resource Constraints: The cluster might be fully utilized, leaving no available resources for new ApplicationMasters. Solution: Monitor and manage cluster resources. Consider scaling the cluster or optimizing the current workload. Queue Configuration Issues: The queue configuration might have stringent limits or no resources allocated for ApplicationMasters. Solution: Review and adjust the queue configurations in the capacity-scheduler.xml or equivalent configuration file to ensure there are sufficient resources for AMs. User Resource Limits: The user might have reached their resource quota in the queue. Solution: Check the per-user resource limits and adjust them if necessary to allow more resource allocation. Steps to Diagnose Further Check Cluster Resource Utilization: Use the ResourceManager web UI or CLI to check the current resource utilization of the cluster. Review Queue Configurations: Inspect the queue configurations, particularly the settings for ApplicationMaster resource limits. Inspect Application Logs: Look at the application logs for any additional diagnostics or error messages. Consult YARN ResourceManager Logs: The ResourceManager logs can provide more context about why resources are unavailable or why the AM assignment is being skipped. By understanding and addressing the issues highlighted in this diagnostic message, you can ensure that your YARN applications get the necessary resources to run effectively.
... View more
03-27-2024
11:44 PM
2 Kudos
The error message indicates Tableau is having trouble connecting to your "ShowData" data source and there's an issue with the SQL query it's trying to run on your Hive database. Let's break down the error and potential solutions: Error Breakdown: Bad Connection: Tableau can't establish a connection to the Hive database. Error Code: B19090E0: Generic Tableau error for connection issues. Error Code: 10002: Hive specific error related to the SQL query. SQL state: TStatus(statusCode:ERROR_STATUS): Hive is encountering an error during query processing. Invalid column reference 'tableausql.fieldname': The specific error points to an invalid column reference in the query. Potential Solutions: Verify Database Connection: Ensure the Hive server is running and accessible from Tableau. Double-check the connection details in your Tableau data source configuration, including server address, port, username, and password. Review SQL Query: The error message highlights "tableausql.fieldname" as an invalid column reference. Check if this field name actually exists in your Hive table. There might be a typo or a case-sensitivity issue. If "tableausql" is a prefix Tableau adds, ensure it's not causing conflicts with your actual column names. Check for Unsupported Functions: In rare cases, Tableau might try to use functions not supported by Hive.
... View more
03-20-2024
11:40 PM
1 Kudo
I have followed the below steps to read the hive table from spark side with credential store: FROM MYSQL CREATE USER IF NOT EXISTS 'gopi'@'%' IDENTIFIED BY 'gopi';
GRANT ALL PRIVILEGES ON * . * TO 'gopi'@'%';
FLUSH PRIVILEGES;
create database if not exists test;
use test;
CREATE TABLE test.EMPLOYEE(
id INT,
name varchar(255),
salary DECIMAL,
dob DATE NOT NULL DEFAULT '2021-05-01',
doj TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP(),
PRIMARY KEY (id)
);
INSERT INTO test.EMPLOYEE (id, name, salary, dob, doj) VALUES (1, "gopi", 10000.00, '1988-06-01', '2020-03-16 09:00:01.000000');
INSERT INTO test.EMPLOYEE (id, name, salary,dob) VALUES (2, "Nishanth", 50000.00, '2018-05-29');
INSERT INTO test.EMPLOYEE (id, name, salary) VALUES (3, "Raja", 30000.00); Create credential store: sudo -u hive hadoop credential create gopi_user.password -v gopi -provider jceks://hdfs/user/hive/gopi_user.jceks
hadoop credential list -provider jceks://hdfs/user/hive/gopi_user.jceks
sudo -u hive hdfs dfs -chmod 400 /user/hive/gopi_user.jceks FROM HIVE USE db_test;
drop table if exists db_test.employee2;
CREATE EXTERNAL TABLE db_test.employee2(
id INT,
name STRING,
salary DOUBLE,
dob DATE,
doj TIMESTAMP
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver",
"hive.sql.jdbc.url" = "jdbc:mysql://ccycloud-1.nightly7x-us-gr.root.comops.site:3306/test",
"hive.sql.dbcp.username" = "gopi",
"hive.sql.dbcp.password.keystore" ="jceks://hdfs/user/hive/gopi_user.jceks",
"hive.sql.dbcp.password.key" = "gopi_user.password",
"hive.sql.query" = "select * from test.EMPLOYEE"
); FROM SPARK sudo -u hive spark-shell \
--jars /opt/cloudera/parcels/CDH/jars/hive-jdbc-handler-3.1.3000.7.2.18.0-622.jar,/usr/share/java/mysql-connector-java.jar
scala> spark.sql("SELECT * FROM db_test.employee1").show()
+---+--------+-------+----------+-------------------+
| id| name| salary| dob| doj|
+---+--------+-------+----------+-------------------+
| 1| gopi|10000.0|1988-06-01|2020-03-16 09:00:01|
| 2|Nishanth|50000.0|2018-05-29|2024-02-27 10:39:22|
| 3| Raja|30000.0|2021-05-01|2024-02-27 10:39:30|
+---+--------+-------+----------+-------------------+
scala> spark.sql("SELECT * FROM db_test.employee2").show()
+---+--------+-------+----------+-------------------+
| id| name| salary| dob| doj|
+---+--------+-------+----------+-------------------+
| 1| gopi|10000.0|1988-06-01|2020-03-16 09:00:01|
| 2|Nishanth|50000.0|2018-05-29|2024-02-27 10:39:22|
| 3| Raja|30000.0|2021-05-01|2024-02-27 10:39:30|
+---+--------+-------+----------+-------------------+
... View more
03-06-2024
12:38 AM
It seems like there might be an issue with the way you're using single quotes in the loop. The variable eachline should be expanded, but it won't if it's enclosed in single quotes. Try using double quotes around the variable and see if that resolves the issue. Here's the corrected loop: for eachline in "${testarray[@]}"
do
beeline -f "${eachline}.sql"
done This way, the value of eachline will be correctly expanded when constructing the command. Also, ensure that the SQL files are present in the correct path or provide the full path if needed. If the issue persists, please provide more details on the error message or behavior you're experiencing for further assistance.
... View more
03-06-2024
12:31 AM
Hive typically relies on the schema definition provided during table creation, and it doesn't perform automatic type conversion while loading data. If there's a mismatch between the data type in the CSV file and the expected data type in the Hive table, it may result in null or incorrect values. Use the CAST function to explicitly convert the data types during the INSERT statement. INSERT INTO TABLE target_table
SELECT
CAST(column1 AS INT),
CAST(column2 AS STRING),
...
FROM source_table; Preprocess your CSV data before loading it into Hive. You can use tools like Apache NiFi or custom scripts to clean and validate the data before ingestion. Remember to thoroughly validate and clean your data before loading it into Hive to avoid unexpected issues. Also, the choice of method depends on your specific use case and the level of control you want over the data loading process.
... View more