About ggangadharan

ggangadharan · ‎11-21-2023

It seems like you want to run the Tez example "OrderedWordCount" using the tez-examples*.jar file. The OrderedWordCount example is part of the Tez examples and demonstrates how to perform a word count with ordering. Assuming you have Tez installed on your system, you can follow these steps: export TEZ_CONF_DIR=/etc/tez/conf/ export TEZ_HOME=/opt/cloudera/parcels/CDH/lib/tez/ export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_HOME}/bin/*:${TEZ_HOME}/* yarn jar ${TEZ_HOME}/bin/tez-examples-*.jar orderedwordcount /somewhere/input /somewhere/output

ggangadharan · ‎11-21-2023

The error message indicates that there is an issue related to resource allocation in YARN, the resource manager in Hadoop. Specifically, the error mentions that the requested resource exceeds the maximum allowed allocation. Here are some steps you can take to address this issue: Review YARN Configuration: Check the YARN configuration settings, particularly those related to resource allocation. Look for properties such as yarn.scheduler.maximum-allocation-mb and yarn.scheduler.maximum-allocation-vcores. Ensure that the values configured for these properties are sufficient for the resources needed by HiveServer2. Increase Maximum Allocation: If the error persists, you might need to increase the maximum allocation for memory and vCores in the YARN scheduler configuration. Update the yarn.scheduler.maximum-allocation-mb and yarn.scheduler.maximum-allocation-vcores properties in the YARN configuration files. Check NodeManager Resources: Verify the resources available on the NodeManagers in your cluster. The maximum allowed allocation is calculated based on the maximum resources of registered NodeManagers. If the NodeManagers have sufficient resources, you can adjust the YARN configuration accordingly. Monitor Resource Usage: Monitor the resource usage in your YARN cluster using tools like the ResourceManager UI or the YARN command-line tools (yarn top, yarn node -list -all, etc.). Identify any patterns of resource exhaustion or contention that could be causing the issue. Review Hive Configuration: Review the Hive configurations related to resource allocation, such as hive.tez.container.size and other relevant settings. Ensure that they are appropriately configured for your cluster. After making any configuration changes, restart the affected services (YARN, HiveServer2) for the changes to take effect.

ggangadharan · ‎11-21-2023

To achieve your goal of loading data from all the latest files in each folder into a single DataFrame, you can collect the file paths from each folder in a list and then load the data into the DataFrame outside the loop. Here's a modified version of your code: import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val static_path = "/user/hdfs/test/partition_date=" val hours = 3 // Creating list of each folder. val paths = (0 until hours) .map(h => currentTs.minusHours(h)) .map(ts => s"${static_path}${ts.toLocalDate}/hour=${ts.getHour}") .toList // Collect the latest file paths from each folder in a list val latestFilePaths = paths.flatMap { eachfolder => val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration) val pathstatus = fs.listStatus(new Path(eachfolder)) val currpathfiles = pathstatus.map(x => (x.getPath.toString, x.getModificationTime)) val latestFilePath = currpathfiles .filter(_._1.endsWith(".csv")) .sortBy(_._2) .reverse .headOption .map(_._1) latestFilePath } // Load data from all the latest files into a single DataFrame val df = spark.read.format("csv").load(latestFilePaths: _*) // Show the combined DataFrame df.show() In this modified code: latestFilePaths is a list that collects the latest file path from each folder. Outside the loop, spark.read.format("csv").load(latestFilePaths: _*) is used to load data from all the latest files into a single DataFrame. Now, df will contain data from all the latest files in each folder, and you can perform further operations or analysis on this combined DataFrame.

ggangadharan · ‎11-21-2023

In Hive, metadata related to tables and columns is typically stored in the 'hive' database, specifically within the 'TBLS' and 'COLUMNS_V2' tables in the 'metastore' database. It is not recommended for users to query the metadata directly. Instead, users can leverage the 'sys' database tables. Here is a modified query that utilizes the 'hive' database tables: sql USE sys; -- Get the count of columns for all tables SELECT t.tbl_name AS TABLE_NAME, COUNT(c.column_name) AS COLUMN_COUNT FROM tbls t JOIN columns_v2 c ON t.tbl_id = c.cd_id GROUP BY t.tbl_name; Explanation: The 'sys.tbls' table contains information about tables, while the 'sys.columns_v2' table contains information about columns. We join these tables on the 'TBL_ID' and 'CD_ID' columns to retrieve information about columns for each table. The 'COUNT(c.COLUMN_NAME)' expression calculates the count of columns for each table. This query provides a list of tables along with the count of columns for each table, using the 'sys' database tables."

ggangadharan · ‎11-21-2023

The error message indicates that there is an inconsistency between the expected schema for the column 'db.table.parameter_11' and the actual schema found in the Parquet file 'hdfs:/path/table/1_data.0.parq'. The column type is expected to be a STRING, but the Parquet schema suggests that it is an optional int64 (integer) column. To resolve this issue, you'll need to investigate and potentially correct the schema mismatch. Here are some steps you can take: Verify the Expected Schema: Check the definition of the 'db.table.parameter_11' column in the Impala metadata or Hive metastore. Ensure that it is defined as a STRING type. Inspect the Parquet File Schema: You can use tools like parquet-tools to inspect the schema of the Parquet file directly. Run the following command in the terminal: bash parquet-tools schema 1_data.0.parq Look for the 'db.table.parameter_11' column and check its data type in the Parquet schema. Compare Expected vs. Actual Schema: Compare the expected schema for 'db.table.parameter_11' with the actual schema found in the Parquet file. Identify any differences in data types. Investigate Data Inconsistencies: If there are data inconsistencies, investigate how they might have occurred. It's possible that there was a schema evolution or a mismatch during the data writing process. Resolve Schema Mismatch: Depending on your findings, you may need to correct the schema mismatch. This could involve updating the metadata in Impala or Hive to match the actual schema or adjusting the Parquet file schema. Update Impala Statistics: After resolving the schema mismatch, it's a good practice to update Impala statistics for the affected table. This can be done using the COMPUTE STATS command in Impala: This step ensures that Impala has up-to-date statistics for query optimization. Here's a high-level example of what the Parquet schema inspection might look like: parquet-tools schema 1_data.0.parq Look for the 'db.table.parameter_11' column and check its data type in the Parquet schema. If the data type in the Parquet schema is incorrect, you may need to investigate how the data was written and whether there were any issues during that process. Correcting the schema mismatch and updating Impala statistics should help resolve the issue.

ggangadharan · ‎11-21-2023

The error indicates that the Hive Server Interactive (HSI) component is failing to start because the LLAP (Live Long and Process) app associated with it couldn't be started. To troubleshoot and resolve this issue, you can follow these general steps: Check LLAP Log Files: Look into the LLAP log files for more detailed error messages. These log files are typically located in a directory like /var/log/hive or a custom location configured in your environment. Examine the LLAP logs to identify any specific errors or issues that are preventing LLAP from starting. Verify LLAP Configuration: Check the LLAP configuration settings, including memory configurations, queue configurations, and other LLAP-specific parameters. Ensure that the configuration is correct and appropriate for your cluster resources. Verify that there are no typos or errors in the LLAP configuration files . Check Resource Availability: Ensure that there are sufficient resources (memory, CPU, etc.) available on the nodes where LLAP is supposed to run. Verify that LLAP is not competing for resources with other applications or services on the cluster. Check Hive Server Interactive Configuration: Review the configuration settings for Hive Server Interactive. Verify that the LLAP configuration is correctly specified in the Hive Server Interactive configurations. Ensure that the LLAP application name, number of instances, and other LLAP-related settings are accurate. Examine System Logs: Check the system logs on the nodes where LLAP is supposed to run. Look for any system-level issues or errors that might be affecting LLAP startup. Restart LLAP Manually: If LLAP fails to start during Hive Server Interactive startup, consider manually starting LLAP to see if you can get more detailed error messages. You can use commands like hive --service llap --start or the Ambari UI to start LLAP separately. Check for LLAP Process: After trying to start LLAP manually, check if the LLAP process is running. You can use tools like ps or jps to see if the LLAP daemon process is running on the expected nodes. Review Ambari Alerts: Check the Ambari Alerts for any warnings or errors related to Hive Server Interactive or LLAP. Ambari often provides helpful alerts and diagnostics. If the LLAP process is still not starting, the detailed logs and error messages should provide more insights into the root cause of the issue. Addressing the specific error or misconfiguration mentioned in the logs will be crucial in resolving the problem

ggangadharan · ‎11-16-2023

When a Pig job gets stuck after creating the JobID, there could be several reasons for this behavior. Here are some common issues and solutions. Data Size and Complexity: Check the size and complexity of your data. If the dataset is very large, the storage operation may take a significant amount of time. Optimize your Pig script if possible, and consider processing a smaller subset of the data for testing. Resource Allocation: Ensure that your Hadoop cluster has sufficient resources allocated for the Pig job. Insufficient memory or available resources can lead to job failures or delays. Check the resource configuration in your Hadoop cluster and adjust it accordingly. Job Monitoring: Use Hadoop JobTracker or ResourceManager web interfaces to monitor the progress of your Pig job. This can provide insights into where the job might be stuck. Look for any error messages or warnings in the logs. Logs and Debugging: Examine the Pig logs for any error messages or stack traces. This can help identify the specific issue causing the job to hang. Enable debugging in Pig by adding -Dmapred.job.tracker=<your_job_tracker> to your Pig command, and check the debug logs for more information. Permissions and Path: Ensure that the specified output path /users/emp/empsalinc is writable by the user running the Pig job. Check for any permission issues or typos in the path. Network Issues: Network issues or connectivity problems between nodes in your Hadoop cluster can also cause jobs to hang. Check the network configuration and try running simpler jobs to isolate the issue. Pig Version Compatibility: Ensure that the version of Pig you are using is compatible with your Hadoop distribution. Incompatibility can lead to unexpected issues. Configuration Settings: Review your Pig script and ensure that the configuration settings are appropriate for your environment. Adjust parameters like mapred.job.queue.name, mapreduce.job.queuename, etc., as needed. Custom UDFs: If your Pig script uses custom User Defined Functions (UDFs), ensure that they are correctly implemented and compatible with the version of Pig you are using. By investigating these aspects, you should be able to identify the root cause of the job getting stuck after creating the JobID and take appropriate action to resolve the issue

ggangadharan · ‎11-09-2023

Hi @HadoopHero , If the query involves dynamic partitioning, one potential issue is that 'hive.optimize.sort.dynamic.partition.threshold' may limit the number of open record writers to just one per partition value, resulting in the creation of only one file. To investigate this, could you attempt disabling 'hive.optimize.sort.dynamic.partition.threshold' entirely? SET hive.optimize.sort.dynamic.partition.threshold=-1; Note : The problem statement contains a typo in the config name

ggangadharan · ‎11-06-2023

The error message "HiveServer2Error: ImpalaRuntimeException: Error making 'add_partitions' RPC to Hive Metastore" typically indicates a problem when Impala, a distributed SQL query engine, tries to interact with the Hive Metastore service to add partitions. This error can be caused by several factors, and it usually points to an issue with the Hive Metastore service or the interaction between Impala and Hive. Here are some common causes and troubleshooting steps for this error: Hive Metastore Service Issues: Check if the Hive Metastore service is up and running. You should ensure that the Hive Metastore service is started and healthy. Verify that the Hive Metastore service is reachable from the machine where Impala is running. Network issues or firewall rules could prevent proper communication. Metastore Configuration: Verify the Metastore configuration in the Impala configuration files (impala-site.xml). Ensure that the Metastore URIs and authentication settings are correctly configured. Metastore Database Issues: Check the health and availability of the underlying database used by the Hive Metastore. Ensure that it's accessible, and there are no database connection issues. Verify that the Metastore database is not overwhelmed or experiencing performance problems. Authorization and Authentication: Verify that the Impala service has the necessary privileges and permissions to interact with the Hive Metastore. Check if Kerberos authentication is enabled, and ensure that the necessary credentials and keytabs are correctly configured. Log Analysis: Examine the logs of both Impala and Hive Metastore services for more detailed error messages. The logs may provide additional information about the root cause of the issue. Resource Limitations: Check if there are any resource limitations (e.g., memory, CPU) on the machines running Impala and the Hive Metastore. Resource shortages can lead to RPC failures. Software Versions: Ensure that Impala and Hive are compatible in terms of versions and dependencies. An incompatible combination of software versions can lead to errors. Cluster Issues: If you are running Impala in a distributed cluster, verify the overall health of the cluster. Other cluster-level issues can sometimes affect the interaction with the Hive Metastore. Network Issues: Check for network-related problems, such as DNS resolution or proxy settings, which can impede communication between Impala and the Hive Metastore. Database Locks: Database locks in the Metastore can sometimes cause issues. Check if there are any locks in the Hive Metastore database. If you have access to detailed logs or additional error messages, those can be particularly helpful in diagnosing the specific problem that led to this error. Depending on your environment and configurations, the resolution may involve addressing one or more of the above factors.

ggangadharan · ‎11-06-2023

The behavior you're observing is related to the precision differences between STRING and FLOAT data types. When you cast a STRING to a FLOAT, Hive attempts to interpret and represent the value as accurately as possible within the constraints of a FLOAT data type. FLOATs are limited in precision, and the fractional part might not be represented exactly. In your example, "5724.95" in FLOAT was stored as "5724.9501953125." This discrepancy is due to the way binary floating-point numbers work and how they might not be able to precisely represent certain decimal values. If you need exact decimal representation, you should consider using a DECIMAL data type instead of FLOAT. DECIMAL provides higher precision and is better suited for scenarios where you need to maintain the exact decimal value without potential loss of precision. Here's how you can cast your STRING column to DECIMAL to preserve the exact decimal value: SELECT a, CAST(a AS DECIMAL(20, 10)) AS exact_value FROM your_table; In this example, DECIMAL(20, 10) indicates a decimal type with a total width of 20 digits and 10 decimal places. This will preserve the exact decimal representation you need. Keep in mind that DECIMAL has higher storage requirements compared to FLOAT because it maintains precision, so choose the appropriate data type based on your requirements. Example : 0: jdbc:hive2://nightly-71x-zg-2.nightly-71x-> SELECT CAST('5724.9501953125' AS DECIMAL(20, 10)) AS decimal_value ; INFO : Compiling command(queryId=hive_20231106110627_aaa98b66-4db6-4307-9be3-598018c13fbf): SELECT CAST('5724.9501953125' AS DECIMAL(20, 10)) AS decimal_value INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:decimal_value, type:decimal(20,10), comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20231106110627_aaa98b66-4db6-4307-9be3-598018c13fbf); Time taken: 0.062 seconds INFO : Executing command(queryId=hive_20231106110627_aaa98b66-4db6-4307-9be3-598018c13fbf): SELECT CAST('5724.9501953125' AS DECIMAL(20, 10)) AS decimal_value INFO : Completed executing command(queryId=hive_20231106110627_aaa98b66-4db6-4307-9be3-598018c13fbf); Time taken: 0.006 seconds INFO : OK +------------------+ | decimal_value | +------------------+ | 5724.9501953125 | +------------------+

Online	Offline
Last Visited	‎10-10-2025 07:39 PM

Member Since	‎09-16-2021 02:45 AM
Last Visited	‎10-10-2025 07:39 PM
Posts	420
Kudos received	54

Cloudera Community

Re: Using Hadoop Iceberg catalog with Hive engine ...

Re: Where can I find the Maven repository for HDP ...

Re: Failed with exception java.io.IOException:org....

Re: Hive on TEZ memory footprint and Impala stats...

Re: Hive little query slow

Re: tez not working properly

Re: Hiveserver2 Is Going Down Frequently while usi...

Re: Loading HDFS Files In a Spark Dataframe Using ...

Re: Query to find the count of columns for all tab...

Re: How to fix Parquet schema: optional int64 amou...

Re: Hive LLAP cant start

Re: pig while storing got stuck after creating the...

Re: Possibility Split Parquet file

Re: HiveServer2Error: ImpalaRuntimeException: Erro...

Re: Hive: cast String to Float alters decimal par...