Member since
09-16-2021
421
Posts
55
Kudos Received
39
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 234 | 10-22-2025 05:48 AM | |
| 312 | 09-05-2025 07:19 AM | |
| 795 | 07-15-2025 02:22 AM | |
| 1347 | 06-02-2025 06:55 AM | |
| 1627 | 05-22-2025 03:00 AM |
09-30-2023
11:39 AM
To pinpoint the root cause, kindly provide a few samples of data
... View more
09-30-2023
11:36 AM
It seems that the failure occurred within the Tez job's child tasks. To identify the root cause, please share the YARN logs of the failed application
... View more
09-29-2023
05:27 AM
To gain a better understanding of the issue, kindly provide the HS2 jstacks at 30-second intervals until the query completes
... View more
09-29-2023
05:25 AM
The stack traces for Error 1 and Error 3 are incomplete. To gain a better understanding of the issue, please provide the complete stack traces. Sharing the complete appLogs will provide a comprehensive view of the situation Regarding error 2, it appears that the job is attempting to create over 2000 dynamic partitions on a single node, which is an unusual behavior. Please review the partition column values for correctness. If everything appears to be in order, you can consider adjusting the following configurations: hive.exec.max.dynamic.partitions hive.exec.max.dynamic.partitions.pernode
... View more
09-29-2023
05:17 AM
It appears that the Hive Metastore (HMS) is unable to establish a connection with the BackendDB, possibly due to an incorrect hostname or BackendDB configuration within the Hive service. Please validate the BackendDB configurations and attempt to start the service again. Exception in thread "main" java.lang.RuntimeException: org.postgresql.util.PSQLException: The connection attempt failed.
at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.countTables(HiveMetastoreDbUtil.java:203)
at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.printTableCount(HiveMetastoreDbUtil.java:284)
at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.main(HiveMetastoreDbUtil.java:354)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:297)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:217)
at org.postgresql.Driver.makeConnection(Driver.java:458)
at org.postgresql.Driver.connect(Driver.java:260)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
at com.cloudera.enterprise.dbutil.SqlRunner.open(SqlRunner.java:193)
at com.cloudera.enterprise.dbutil.SqlRunner.getDatabaseName(SqlRunner.java:264)
at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.countTables(HiveMetastoreDbUtil.java:197)
... 2 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.postgresql.core.PGStream.<init>(PGStream.java:81)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:93)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:197)
... View more
09-27-2023
07:43 AM
if the partition data exists like below: <s3:bucket>/<some_location>/<part_column>=<part_value>/<filename> you can create a external table by specifiying above location and run 'msck repair table <table_name> sync partitions' to sync partitions. validate the data by running some sample select statements. Once it's done you can create new external table with another bucket and run insert statement with dynamic partition. Ref - https://cwiki.apache.org/confluence/display/hive/dynamicpartitions
... View more
09-27-2023
07:12 AM
1 Kudo
In CDP, when the HiveProtoLoggingHook is configured, query information is automatically captured and stored in the 'query_data' folder, which is typically located where 'hive.hook.proto.base-directory' is set. These details are saved as protobuf files, and in Hive, you can utilize the ProtobufMessageSerDe to access them. To read this captured data, you can create a table as shown below. CREATE EXTERNAL TABLE `query_data`(
`eventtype` string COMMENT 'from deserializer',
`hivequeryid` string COMMENT 'from deserializer',
`timestamp` bigint COMMENT 'from deserializer',
`executionmode` string COMMENT 'from deserializer',
`requestuser` string COMMENT 'from deserializer',
`queue` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`operationid` string COMMENT 'from deserializer',
`tableswritten` array<string> COMMENT 'from deserializer',
`tablesread` array<string> COMMENT 'from deserializer',
`otherinfo` map<string,string> COMMENT 'from deserializer')
PARTITIONED BY (
`date` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageSerDe'
WITH SERDEPROPERTIES (
'proto.class'='org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto',
'proto.maptypes'='org.apache.hadoop.hive.ql.hooks.proto.MapFieldEntry')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
LOCATION
'<query_datalocation>'
TBLPROPERTIES (
'bucketing_version'='2',
'proto.class'='org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto') After creating the table, execute 'msck repair query_data sync partitions' to synchronize the partitions, and then you can retrieve and analyze the data using Beeline.
... View more
08-09-2023
05:23 AM
1 Kudo
Hadoop itself does not inherently provide real-time estimation of job completion time out of the box.However, Hadoop does have some features and tools that can help you monitor and estimate the progress and completion time of jobs JobTracker/ResourceManager Web UI: Hadoop's JobTracker (in Hadoop 1.x) or ResourceManager Web UI (in Hadoop 2.x and later) provides information about the status and progress of running jobs. While it doesn't give you an exact completion time estimate, it does show the map and reduce progress, number of tasks completed, and other relevant details that can help you gauge the progress. MapReduce Counters: Hadoop MapReduce jobs expose counters that provide insight into the progress of various phases of the job. You can use these counters to estimate how much work has been completed and how much is remaining. Hadoop Job History Logs: Hadoop maintains detailed logs of job executions. By analyzing these logs, you can gain insights into the historical performance of jobs and potentially use this information to estimate completion times for similar jobs in the future. Custom Scripting: You can also write custom scripts or applications that monitor the progress of jobs by querying Hadoop's APIs and estimating completion times based on historical data and current progress. Remember that estimating job completion time in distributed systems like Hadoop can be challenging due to the dynamic nature of the environment and the potential variability in task execution times. It's important to understand that these estimates might not always be accurate and can be affected by various factors such as cluster load, data distribution, and hardware performance.
... View more
07-20-2023
03:10 AM
We verified the same in the CDP environment, as we are uncertain about the Databricks Spark environment. As we have mixed of managed and external tables , extracted the necessary information through HWC. >>> database=spark.sql("show tables in default").collect()
23/07/20 10:04:45 INFO rule.HWCSwitchRule: Registering Listeners
23/07/20 10:04:47 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist
Hive Session ID = e6f70006-0c2e-4237-9a9e-e1d19901af54
>>> desiredColumn="name"
>>> tablenames = []
>>> for row in database:
... cols = spark.table(row.tableName).columns
... listColumns= spark.table(row.tableName).columns
... if desiredColumn in listColumns:
... tablenames.append(row.tableName)
...
>>>
>>> print("\n".join(tablenames))
movies
tv_series_abc
cdp1
tv_series
spark_array_string_example
>>>
... View more
07-15-2023
12:42 AM
@Sunanna Validate the job status using below command. hadoop job -status <hadoop_job_id>
yarn application -status <hadoop_application_id> Depends upon the status validate the logs using below , If needed validate the Jstack of the child tasks for better understanding. yarn logs -applicationId <applicationId>
... View more