About ggangadharan

ggangadharan · ‎10-10-2023

When users query a Hive table partitioned on a specific column (in your case, "source system name") but do not include a filter condition on that partition column in their queries, Hive may need to scan all partitions of the table to retrieve the relevant data. This can lead to less efficient query performance, as it requires reading unnecessary data from multiple partitions. In your scenario, where you perform frequent insert overwrites to keep only the current data, the table may not grow drastically in terms of total data volume. However, if the users frequently query the table without specifying the partition column condition, it can still result in increased query processing time and resource utilisation. To improve query efficiency in this situation, you have a few options: Partition Pruning: Encourage users to include the partition column condition in their queries. Hive has built-in partition pruning optimization, which allows it to skip unnecessary partitions when the partition column condition is provided. Materialized Views: If certain common query patterns exist, consider creating materialized views that pre-aggregate or pre-filter data based on those patterns. This can significantly speed up queries that align with the materialized views. Optimize Data Layout: Ensure that the data is stored efficiently, and consider using columnar storage formats like ORC or Parquet, which can improve query performance. Ultimately, the choice of optimization strategy depends on the specific usage patterns and requirements of your users. It's essential to monitor query performance and understand your users' query behavior to determine which optimization approaches are most effective.

ggangadharan · ‎10-09-2023

Could you kindly provide the DDL and a sample dataset to facilitate a more in-depth explanation?

DianaTorres · ‎10-03-2023

@yucai Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.@

ggangadharan · ‎10-03-2023

It seems that the query involves dynamic partitioning, but the dynamic partition column is not included in either the select statement or the Common Table Expression (CTE). Please add the dynamic partition column 'date' to the select statement and validate it in Beeline.

ggangadharan · ‎10-03-2023

It appears that you're currently following a two-step process: writing data to a Parquet table and then using that Parquet table to write to an ORC table. You can streamline this by directly writing the data into the ORC format table, eliminating the need to write the same data to a Parquet table before reading it. Ref - https://spark.apache.org/docs/2.4.0/sql-data-sources-hive-tables.html https://docs.cloudera.com/cdp-private-cloud-base/7.1.9/developing-spark-applications/topics/spark-loading-orc-data-predicate-push-down.html

ggangadharan · ‎10-03-2023

You can use the following HQL (SQL) query to find the sum of new cases for each continent and identify the country with the highest number of cases in each continent along with the corresponding case count: WITH ContinentSum AS ( SELECT continent, SUM(new_cases) AS total_new_cases FROM sample_table_test GROUP BY continent ), CountryMaxCases AS ( SELECT continent, location AS country, MAX(total_case) AS max_cases FROM sample_table_test GROUP BY continent, location ) SELECT cs.continent, cs.total_new_cases, cm.country, cm.max_cases FROM ContinentSum cs JOIN CountryMaxCases cm ON cs.continent = cm.continent AND cs.total_new_cases = cm.max_cases; This query first calculates the sum of new cases for each continent in the ContinentSum CTE (Common Table Expression). Then, it finds the country with the highest total cases in each continent using the CountryMaxCases CTE. Finally, it joins the results from both CTEs to provide the desired output. Sample resultset for the shared data. +---------------+---------------------+--------------+---------------+ | cs.continent | cs.total_new_cases | cm.country | cm.max_cases | +---------------+---------------------+--------------+---------------+ | Asia | 25.0 | Afghanistan | 25.0 | +---------------+---------------------+--------------+---------------+

ggangadharan · ‎09-30-2023

To pinpoint the root cause, kindly provide a few samples of data

ggangadharan · ‎09-29-2023

To gain a better understanding of the issue, kindly provide the HS2 jstacks at 30-second intervals until the query completes

ggangadharan · ‎09-29-2023

The stack traces for Error 1 and Error 3 are incomplete. To gain a better understanding of the issue, please provide the complete stack traces. Sharing the complete appLogs will provide a comprehensive view of the situation Regarding error 2, it appears that the job is attempting to create over 2000 dynamic partitions on a single node, which is an unusual behavior. Please review the partition column values for correctness. If everything appears to be in order, you can consider adjusting the following configurations: hive.exec.max.dynamic.partitions hive.exec.max.dynamic.partitions.pernode

ggangadharan · ‎09-29-2023

It appears that the Hive Metastore (HMS) is unable to establish a connection with the BackendDB, possibly due to an incorrect hostname or BackendDB configuration within the Hive service. Please validate the BackendDB configurations and attempt to start the service again. Exception in thread "main" java.lang.RuntimeException: org.postgresql.util.PSQLException: The connection attempt failed. at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.countTables(HiveMetastoreDbUtil.java:203) at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.printTableCount(HiveMetastoreDbUtil.java:284) at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.main(HiveMetastoreDbUtil.java:354) Caused by: org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:297) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:217) at org.postgresql.Driver.makeConnection(Driver.java:458) at org.postgresql.Driver.connect(Driver.java:260) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:247) at com.cloudera.enterprise.dbutil.SqlRunner.open(SqlRunner.java:193) at com.cloudera.enterprise.dbutil.SqlRunner.getDatabaseName(SqlRunner.java:264) at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.countTables(HiveMetastoreDbUtil.java:197) ... 2 more Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:607) at org.postgresql.core.PGStream.<init>(PGStream.java:81) at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:93) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:197)

Online	Offline
Last Visited	‎01-08-2025 12:21 AM

Member Since	‎09-16-2021 02:45 AM
Last Visited	‎01-08-2025 12:21 AM
Posts	337
Kudos received	53

Cloudera Community

Re: Hive Job - OutOfMemoryError: Java heap space

Re: Insert into table test values('a', 'b'); not w...

Re: how to drop partition table using date_add fun...

Re: Issue with Hive HQL insert query - KryoExcepti...

Re: Error when do an alter table change column on ...

Re: Hive Performance Tuning

Re: Why doesn't hive bucketing work at partition l...

Re: Hive on tez cannot execute custom hook program...

Re: Ozzie Workflow for CTE and Insert Statement

Re: Insert data from spark data frame to hive orc ...

Re: Question PIG and HIVE Code

Re: Why does a NULL value appear when I count to c...

Re: Is it possible to disable column level securit...

Re: CDP - Facing issues with Hive

Re: Create Hive Metastore database tables has fail...