About ggangadharan

ggangadharan · ‎09-30-2023

It seems that the failure occurred within the Tez job's child tasks. To identify the root cause, please share the YARN logs of the failed application

ggangadharan · ‎09-29-2023

To gain a better understanding of the issue, kindly provide the HS2 jstacks at 30-second intervals until the query completes

ggangadharan · ‎09-29-2023

The stack traces for Error 1 and Error 3 are incomplete. To gain a better understanding of the issue, please provide the complete stack traces. Sharing the complete appLogs will provide a comprehensive view of the situation Regarding error 2, it appears that the job is attempting to create over 2000 dynamic partitions on a single node, which is an unusual behavior. Please review the partition column values for correctness. If everything appears to be in order, you can consider adjusting the following configurations: hive.exec.max.dynamic.partitions hive.exec.max.dynamic.partitions.pernode

ggangadharan · ‎09-29-2023

It appears that the Hive Metastore (HMS) is unable to establish a connection with the BackendDB, possibly due to an incorrect hostname or BackendDB configuration within the Hive service. Please validate the BackendDB configurations and attempt to start the service again. Exception in thread "main" java.lang.RuntimeException: org.postgresql.util.PSQLException: The connection attempt failed. at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.countTables(HiveMetastoreDbUtil.java:203) at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.printTableCount(HiveMetastoreDbUtil.java:284) at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.main(HiveMetastoreDbUtil.java:354) Caused by: org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:297) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:217) at org.postgresql.Driver.makeConnection(Driver.java:458) at org.postgresql.Driver.connect(Driver.java:260) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:247) at com.cloudera.enterprise.dbutil.SqlRunner.open(SqlRunner.java:193) at com.cloudera.enterprise.dbutil.SqlRunner.getDatabaseName(SqlRunner.java:264) at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.countTables(HiveMetastoreDbUtil.java:197) ... 2 more Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:607) at org.postgresql.core.PGStream.<init>(PGStream.java:81) at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:93) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:197)

ggangadharan · ‎09-27-2023

if the partition data exists like below: <s3:bucket>/<some_location>/<part_column>=<part_value>/<filename> you can create a external table by specifiying above location and run 'msck repair table <table_name> sync partitions' to sync partitions. validate the data by running some sample select statements. Once it's done you can create new external table with another bucket and run insert statement with dynamic partition. Ref - https://cwiki.apache.org/confluence/display/hive/dynamicpartitions

ggangadharan · ‎09-27-2023

In CDP, when the HiveProtoLoggingHook is configured, query information is automatically captured and stored in the 'query_data' folder, which is typically located where 'hive.hook.proto.base-directory' is set. These details are saved as protobuf files, and in Hive, you can utilize the ProtobufMessageSerDe to access them. To read this captured data, you can create a table as shown below. CREATE EXTERNAL TABLE `query_data`( `eventtype` string COMMENT 'from deserializer', `hivequeryid` string COMMENT 'from deserializer', `timestamp` bigint COMMENT 'from deserializer', `executionmode` string COMMENT 'from deserializer', `requestuser` string COMMENT 'from deserializer', `queue` string COMMENT 'from deserializer', `user` string COMMENT 'from deserializer', `operationid` string COMMENT 'from deserializer', `tableswritten` array<string> COMMENT 'from deserializer', `tablesread` array<string> COMMENT 'from deserializer', `otherinfo` map<string,string> COMMENT 'from deserializer') PARTITIONED BY ( `date` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageSerDe' WITH SERDEPROPERTIES ( 'proto.class'='org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto', 'proto.maptypes'='org.apache.hadoop.hive.ql.hooks.proto.MapFieldEntry') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat' LOCATION '<query_datalocation>' TBLPROPERTIES ( 'bucketing_version'='2', 'proto.class'='org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto') After creating the table, execute 'msck repair query_data sync partitions' to synchronize the partitions, and then you can retrieve and analyze the data using Beeline.

ggangadharan · ‎08-09-2023

Hadoop itself does not inherently provide real-time estimation of job completion time out of the box.However, Hadoop does have some features and tools that can help you monitor and estimate the progress and completion time of jobs JobTracker/ResourceManager Web UI: Hadoop's JobTracker (in Hadoop 1.x) or ResourceManager Web UI (in Hadoop 2.x and later) provides information about the status and progress of running jobs. While it doesn't give you an exact completion time estimate, it does show the map and reduce progress, number of tasks completed, and other relevant details that can help you gauge the progress. MapReduce Counters: Hadoop MapReduce jobs expose counters that provide insight into the progress of various phases of the job. You can use these counters to estimate how much work has been completed and how much is remaining. Hadoop Job History Logs: Hadoop maintains detailed logs of job executions. By analyzing these logs, you can gain insights into the historical performance of jobs and potentially use this information to estimate completion times for similar jobs in the future. Custom Scripting: You can also write custom scripts or applications that monitor the progress of jobs by querying Hadoop's APIs and estimating completion times based on historical data and current progress. Remember that estimating job completion time in distributed systems like Hadoop can be challenging due to the dynamic nature of the environment and the potential variability in task execution times. It's important to understand that these estimates might not always be accurate and can be affected by various factors such as cluster load, data distribution, and hardware performance.

ggangadharan · ‎07-20-2023

We verified the same in the CDP environment, as we are uncertain about the Databricks Spark environment. As we have mixed of managed and external tables , extracted the necessary information through HWC. >>> database=spark.sql("show tables in default").collect() 23/07/20 10:04:45 INFO rule.HWCSwitchRule: Registering Listeners 23/07/20 10:04:47 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist Hive Session ID = e6f70006-0c2e-4237-9a9e-e1d19901af54 >>> desiredColumn="name" >>> tablenames = [] >>> for row in database: ... cols = spark.table(row.tableName).columns ... listColumns= spark.table(row.tableName).columns ... if desiredColumn in listColumns: ... tablenames.append(row.tableName) ... >>> >>> print("\n".join(tablenames)) movies tv_series_abc cdp1 tv_series spark_array_string_example >>>

ggangadharan · ‎07-15-2023

@Sunanna Validate the job status using below command. hadoop job -status <hadoop_job_id> yarn application -status <hadoop_application_id> Depends upon the status validate the logs using below , If needed validate the Jstack of the child tasks for better understanding. yarn logs -applicationId <applicationId>

ggangadharan · ‎07-14-2023

Spark JDBC reader is capable of reading data in parallel by splitting it into several partitions. There are four options. partitionColumn is the name of the column used for partitioning. An important condition is that the column must be numeric (integer or decimal), date or timestamp type. If the partitionColumn parameter is not specified, Spark will use a single executor and create one non-empty partition. Reading data will not be distributed or parallelized. numPartitions is the maximum number of partitions that can be used for simultaneous table reading and writing. The lowerBound and upperBound boundaries used to define the partition width. These boundaries determines how many rows from a given range of partition column values can be within a single partition. For Example - df = spark.read \ .format("jdbc") \ .option("url", "jdbc:postgresql:postgres") \ .option("dbtable", "db.table") \ .option("user", "user")\ .option("password", "pass") \ .option("numPartitions", "10") \ .option("lowerBound", "100") \ .option("upperBound", "1100") \ .load() This method will use the upper and lower bounds, and number of partitions to create where clauses. For example, if lower bound is set to 100, upper bound is 1,000, and number of partitions is 10, then the number of rows read by each task, called stride in the reference documentation, will be: (upper bound - lower bound) / number of partitions (1100 - 100) / 10 = 100 And the series of filters applied to each task will be: where partitionColumn < 100 where partitionColumn >= 100 and partitionColumn < 200 where partitionColumn >= 200 and partitionColumn < 300 ... where partitionColumn >= 1100 The lowerBound and upperBound define partitioning boundaries, but they DO NOT participate in filtering rows of the table. Therefore, Spark partitions and returns ALL the rows of the table. It is important to note that all data will be read whether partitioning is used or not. For example suppose we have partitionColumn data range in [0, 10000] and we set numPartitions=10, lowerBound=4000 and upperBound=5000. As shown in the illustration above, the first and last partitions will contain all the data outside of the corresponding upper or lower boundary. Another example, suppose we have partitionColumn data range in [2000, 4000] and we set numPartitions=10, lowerBound=0 and upperBound=10000. In this case, then only 2 of the 10 queries (one for each partition) will do all the work, not ideal. In this scenario, the best configuration would be numPartitions=10, lowerBound=2000, upperBound=4000

Online	Offline
Last Visited	‎01-08-2025 12:21 AM

Member Since	‎09-16-2021 02:45 AM
Last Visited	‎01-08-2025 12:21 AM
Posts	337
Kudos received	53

Cloudera Community

Re: Hive Job - OutOfMemoryError: Java heap space

Re: Insert into table test values('a', 'b'); not w...

Re: how to drop partition table using date_add fun...

Re: Issue with Hive HQL insert query - KryoExcepti...

Re: Error when do an alter table change column on ...

Re: Hortonworks Hive 35 Error

Re: Is it possible to disable column level securit...

Re: CDP - Facing issues with Hive

Re: Create Hive Metastore database tables has fail...

Re: How to load existing partitoned parquet data i...

Re: Hive on tez cannot execute custom hook program...

Re: Is Cloudera have estimation time for jobs comp...

Re: Databricks Error Inquiry: org.apache.spark.Spa...

Re: mapreduce wordcount program got stuck

Re: reading data from oracle in parallel