About ggangadharan

ggangadharan · ‎09-30-2023

To pinpoint the root cause, kindly provide a few samples of data

ggangadharan · ‎09-30-2023

It seems that the failure occurred within the Tez job's child tasks. To identify the root cause, please share the YARN logs of the failed application

ggangadharan · ‎09-29-2023

To gain a better understanding of the issue, kindly provide the HS2 jstacks at 30-second intervals until the query completes

ggangadharan · ‎09-29-2023

The stack traces for Error 1 and Error 3 are incomplete. To gain a better understanding of the issue, please provide the complete stack traces. Sharing the complete appLogs will provide a comprehensive view of the situation Regarding error 2, it appears that the job is attempting to create over 2000 dynamic partitions on a single node, which is an unusual behavior. Please review the partition column values for correctness. If everything appears to be in order, you can consider adjusting the following configurations: hive.exec.max.dynamic.partitions hive.exec.max.dynamic.partitions.pernode

ggangadharan · ‎09-29-2023

It appears that the Hive Metastore (HMS) is unable to establish a connection with the BackendDB, possibly due to an incorrect hostname or BackendDB configuration within the Hive service. Please validate the BackendDB configurations and attempt to start the service again. Exception in thread "main" java.lang.RuntimeException: org.postgresql.util.PSQLException: The connection attempt failed. at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.countTables(HiveMetastoreDbUtil.java:203) at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.printTableCount(HiveMetastoreDbUtil.java:284) at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.main(HiveMetastoreDbUtil.java:354) Caused by: org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:297) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:217) at org.postgresql.Driver.makeConnection(Driver.java:458) at org.postgresql.Driver.connect(Driver.java:260) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:247) at com.cloudera.enterprise.dbutil.SqlRunner.open(SqlRunner.java:193) at com.cloudera.enterprise.dbutil.SqlRunner.getDatabaseName(SqlRunner.java:264) at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.countTables(HiveMetastoreDbUtil.java:197) ... 2 more Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:607) at org.postgresql.core.PGStream.<init>(PGStream.java:81) at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:93) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:197)

ggangadharan · ‎09-27-2023

if the partition data exists like below: <s3:bucket>/<some_location>/<part_column>=<part_value>/<filename> you can create a external table by specifiying above location and run 'msck repair table <table_name> sync partitions' to sync partitions. validate the data by running some sample select statements. Once it's done you can create new external table with another bucket and run insert statement with dynamic partition. Ref - https://cwiki.apache.org/confluence/display/hive/dynamicpartitions

ggangadharan · ‎09-27-2023

In CDP, when the HiveProtoLoggingHook is configured, query information is automatically captured and stored in the 'query_data' folder, which is typically located where 'hive.hook.proto.base-directory' is set. These details are saved as protobuf files, and in Hive, you can utilize the ProtobufMessageSerDe to access them. To read this captured data, you can create a table as shown below. CREATE EXTERNAL TABLE `query_data`( `eventtype` string COMMENT 'from deserializer', `hivequeryid` string COMMENT 'from deserializer', `timestamp` bigint COMMENT 'from deserializer', `executionmode` string COMMENT 'from deserializer', `requestuser` string COMMENT 'from deserializer', `queue` string COMMENT 'from deserializer', `user` string COMMENT 'from deserializer', `operationid` string COMMENT 'from deserializer', `tableswritten` array<string> COMMENT 'from deserializer', `tablesread` array<string> COMMENT 'from deserializer', `otherinfo` map<string,string> COMMENT 'from deserializer') PARTITIONED BY ( `date` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageSerDe' WITH SERDEPROPERTIES ( 'proto.class'='org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto', 'proto.maptypes'='org.apache.hadoop.hive.ql.hooks.proto.MapFieldEntry') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat' LOCATION '<query_datalocation>' TBLPROPERTIES ( 'bucketing_version'='2', 'proto.class'='org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto') After creating the table, execute 'msck repair query_data sync partitions' to synchronize the partitions, and then you can retrieve and analyze the data using Beeline.

ggangadharan · ‎08-09-2023

Hadoop itself does not inherently provide real-time estimation of job completion time out of the box.However, Hadoop does have some features and tools that can help you monitor and estimate the progress and completion time of jobs JobTracker/ResourceManager Web UI: Hadoop's JobTracker (in Hadoop 1.x) or ResourceManager Web UI (in Hadoop 2.x and later) provides information about the status and progress of running jobs. While it doesn't give you an exact completion time estimate, it does show the map and reduce progress, number of tasks completed, and other relevant details that can help you gauge the progress. MapReduce Counters: Hadoop MapReduce jobs expose counters that provide insight into the progress of various phases of the job. You can use these counters to estimate how much work has been completed and how much is remaining. Hadoop Job History Logs: Hadoop maintains detailed logs of job executions. By analyzing these logs, you can gain insights into the historical performance of jobs and potentially use this information to estimate completion times for similar jobs in the future. Custom Scripting: You can also write custom scripts or applications that monitor the progress of jobs by querying Hadoop's APIs and estimating completion times based on historical data and current progress. Remember that estimating job completion time in distributed systems like Hadoop can be challenging due to the dynamic nature of the environment and the potential variability in task execution times. It's important to understand that these estimates might not always be accurate and can be affected by various factors such as cluster load, data distribution, and hardware performance.

ggangadharan · ‎07-20-2023

We verified the same in the CDP environment, as we are uncertain about the Databricks Spark environment. As we have mixed of managed and external tables , extracted the necessary information through HWC. >>> database=spark.sql("show tables in default").collect() 23/07/20 10:04:45 INFO rule.HWCSwitchRule: Registering Listeners 23/07/20 10:04:47 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist Hive Session ID = e6f70006-0c2e-4237-9a9e-e1d19901af54 >>> desiredColumn="name" >>> tablenames = [] >>> for row in database: ... cols = spark.table(row.tableName).columns ... listColumns= spark.table(row.tableName).columns ... if desiredColumn in listColumns: ... tablenames.append(row.tableName) ... >>> >>> print("\n".join(tablenames)) movies tv_series_abc cdp1 tv_series spark_array_string_example >>>

ggangadharan · ‎07-15-2023

@Sunanna Validate the job status using below command. hadoop job -status <hadoop_job_id> yarn application -status <hadoop_application_id> Depends upon the status validate the logs using below , If needed validate the Jstack of the child tasks for better understanding. yarn logs -applicationId <applicationId>

Online	Offline
Last Visited	‎10-27-2025 08:02 AM

Member Since	‎09-16-2021 02:45 AM
Last Visited	‎10-27-2025 08:02 AM
Posts	421
Kudos received	55

Cloudera Community

Re: HWC on CDP 7.3.1 with Spark 3.5

Re: Using Hadoop Iceberg catalog with Hive engine ...

Re: Where can I find the Maven repository for HDP ...

Re: Failed with exception java.io.IOException:org....

Re: Hive on TEZ memory footprint and Impala stats...

Re: Why does a NULL value appear when I count to c...

Re: Hortonworks Hive 35 Error

Re: Is it possible to disable column level securit...

Re: CDP - Facing issues with Hive

Re: Create Hive Metastore database tables has fail...

Re: How to load existing partitoned parquet data i...

Re: Hive on tez cannot execute custom hook program...

Re: Is Cloudera have estimation time for jobs comp...

Re: Databricks Error Inquiry: org.apache.spark.Spa...

Re: mapreduce wordcount program got stuck