Member since
09-16-2021
330
Posts
52
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
238 | 11-10-2024 11:19 PM | |
370 | 10-25-2024 05:02 AM | |
1940 | 09-10-2024 07:50 AM | |
697 | 09-04-2024 05:35 AM | |
1552 | 08-28-2024 12:40 AM |
09-29-2023
05:27 AM
To gain a better understanding of the issue, kindly provide the HS2 jstacks at 30-second intervals until the query completes
... View more
09-29-2023
05:25 AM
The stack traces for Error 1 and Error 3 are incomplete. To gain a better understanding of the issue, please provide the complete stack traces. Sharing the complete appLogs will provide a comprehensive view of the situation Regarding error 2, it appears that the job is attempting to create over 2000 dynamic partitions on a single node, which is an unusual behavior. Please review the partition column values for correctness. If everything appears to be in order, you can consider adjusting the following configurations: hive.exec.max.dynamic.partitions hive.exec.max.dynamic.partitions.pernode
... View more
09-29-2023
05:17 AM
It appears that the Hive Metastore (HMS) is unable to establish a connection with the BackendDB, possibly due to an incorrect hostname or BackendDB configuration within the Hive service. Please validate the BackendDB configurations and attempt to start the service again. Exception in thread "main" java.lang.RuntimeException: org.postgresql.util.PSQLException: The connection attempt failed.
at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.countTables(HiveMetastoreDbUtil.java:203)
at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.printTableCount(HiveMetastoreDbUtil.java:284)
at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.main(HiveMetastoreDbUtil.java:354)
Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:297)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:217)
at org.postgresql.Driver.makeConnection(Driver.java:458)
at org.postgresql.Driver.connect(Driver.java:260)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
at com.cloudera.enterprise.dbutil.SqlRunner.open(SqlRunner.java:193)
at com.cloudera.enterprise.dbutil.SqlRunner.getDatabaseName(SqlRunner.java:264)
at com.cloudera.cmf.service.hive.HiveMetastoreDbUtil.countTables(HiveMetastoreDbUtil.java:197)
... 2 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.postgresql.core.PGStream.<init>(PGStream.java:81)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:93)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:197)
... View more
09-27-2023
07:43 AM
if the partition data exists like below: <s3:bucket>/<some_location>/<part_column>=<part_value>/<filename> you can create a external table by specifiying above location and run 'msck repair table <table_name> sync partitions' to sync partitions. validate the data by running some sample select statements. Once it's done you can create new external table with another bucket and run insert statement with dynamic partition. Ref - https://cwiki.apache.org/confluence/display/hive/dynamicpartitions
... View more
09-27-2023
07:12 AM
1 Kudo
In CDP, when the HiveProtoLoggingHook is configured, query information is automatically captured and stored in the 'query_data' folder, which is typically located where 'hive.hook.proto.base-directory' is set. These details are saved as protobuf files, and in Hive, you can utilize the ProtobufMessageSerDe to access them. To read this captured data, you can create a table as shown below. CREATE EXTERNAL TABLE `query_data`(
`eventtype` string COMMENT 'from deserializer',
`hivequeryid` string COMMENT 'from deserializer',
`timestamp` bigint COMMENT 'from deserializer',
`executionmode` string COMMENT 'from deserializer',
`requestuser` string COMMENT 'from deserializer',
`queue` string COMMENT 'from deserializer',
`user` string COMMENT 'from deserializer',
`operationid` string COMMENT 'from deserializer',
`tableswritten` array<string> COMMENT 'from deserializer',
`tablesread` array<string> COMMENT 'from deserializer',
`otherinfo` map<string,string> COMMENT 'from deserializer')
PARTITIONED BY (
`date` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageSerDe'
WITH SERDEPROPERTIES (
'proto.class'='org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto',
'proto.maptypes'='org.apache.hadoop.hive.ql.hooks.proto.MapFieldEntry')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
LOCATION
'<query_datalocation>'
TBLPROPERTIES (
'bucketing_version'='2',
'proto.class'='org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto') After creating the table, execute 'msck repair query_data sync partitions' to synchronize the partitions, and then you can retrieve and analyze the data using Beeline.
... View more
08-09-2023
05:23 AM
1 Kudo
Hadoop itself does not inherently provide real-time estimation of job completion time out of the box.However, Hadoop does have some features and tools that can help you monitor and estimate the progress and completion time of jobs JobTracker/ResourceManager Web UI: Hadoop's JobTracker (in Hadoop 1.x) or ResourceManager Web UI (in Hadoop 2.x and later) provides information about the status and progress of running jobs. While it doesn't give you an exact completion time estimate, it does show the map and reduce progress, number of tasks completed, and other relevant details that can help you gauge the progress. MapReduce Counters: Hadoop MapReduce jobs expose counters that provide insight into the progress of various phases of the job. You can use these counters to estimate how much work has been completed and how much is remaining. Hadoop Job History Logs: Hadoop maintains detailed logs of job executions. By analyzing these logs, you can gain insights into the historical performance of jobs and potentially use this information to estimate completion times for similar jobs in the future. Custom Scripting: You can also write custom scripts or applications that monitor the progress of jobs by querying Hadoop's APIs and estimating completion times based on historical data and current progress. Remember that estimating job completion time in distributed systems like Hadoop can be challenging due to the dynamic nature of the environment and the potential variability in task execution times. It's important to understand that these estimates might not always be accurate and can be affected by various factors such as cluster load, data distribution, and hardware performance.
... View more
07-20-2023
03:10 AM
We verified the same in the CDP environment, as we are uncertain about the Databricks Spark environment. As we have mixed of managed and external tables , extracted the necessary information through HWC. >>> database=spark.sql("show tables in default").collect()
23/07/20 10:04:45 INFO rule.HWCSwitchRule: Registering Listeners
23/07/20 10:04:47 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist
Hive Session ID = e6f70006-0c2e-4237-9a9e-e1d19901af54
>>> desiredColumn="name"
>>> tablenames = []
>>> for row in database:
... cols = spark.table(row.tableName).columns
... listColumns= spark.table(row.tableName).columns
... if desiredColumn in listColumns:
... tablenames.append(row.tableName)
...
>>>
>>> print("\n".join(tablenames))
movies
tv_series_abc
cdp1
tv_series
spark_array_string_example
>>>
... View more
07-15-2023
12:42 AM
@Sunanna Validate the job status using below command. hadoop job -status <hadoop_job_id>
yarn application -status <hadoop_application_id> Depends upon the status validate the logs using below , If needed validate the Jstack of the child tasks for better understanding. yarn logs -applicationId <applicationId>
... View more
07-14-2023
02:53 AM
Spark JDBC reader is capable of reading data in parallel by splitting it into several partitions. There are four options. partitionColumn is the name of the column used for partitioning. An important condition is that the column must be numeric (integer or decimal), date or timestamp type. If the partitionColumn parameter is not specified, Spark will use a single executor and create one non-empty partition. Reading data will not be distributed or parallelized. numPartitions is the maximum number of partitions that can be used for simultaneous table reading and writing. The lowerBound and upperBound boundaries used to define the partition width. These boundaries determines how many rows from a given range of partition column values can be within a single partition. For Example - df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:postgres") \
.option("dbtable", "db.table") \
.option("user", "user")\
.option("password", "pass") \
.option("numPartitions", "10") \
.option("lowerBound", "100") \
.option("upperBound", "1100") \
.load() This method will use the upper and lower bounds, and number of partitions to create where clauses. For example, if lower bound is set to 100, upper bound is 1,000, and number of partitions is 10, then the number of rows read by each task, called stride in the reference documentation, will be: (upper bound - lower bound) / number of partitions (1100 - 100) / 10 = 100 And the series of filters applied to each task will be: where partitionColumn < 100 where partitionColumn >= 100 and partitionColumn < 200 where partitionColumn >= 200 and partitionColumn < 300 ... where partitionColumn >= 1100 The lowerBound and upperBound define partitioning boundaries, but they DO NOT participate in filtering rows of the table. Therefore, Spark partitions and returns ALL the rows of the table. It is important to note that all data will be read whether partitioning is used or not. For example suppose we have partitionColumn data range in [0, 10000] and we set numPartitions=10, lowerBound=4000 and upperBound=5000. As shown in the illustration above, the first and last partitions will contain all the data outside of the corresponding upper or lower boundary. Another example, suppose we have partitionColumn data range in [2000, 4000] and we set numPartitions=10, lowerBound=0 and upperBound=10000. In this case, then only 2 of the 10 queries (one for each partition) will do all the work, not ideal. In this scenario, the best configuration would be numPartitions=10, lowerBound=2000, upperBound=4000
... View more
07-14-2023
02:39 AM
If my understanding is correct, the schema is altered for different input files, which implies that the data itself lacks a structured schema. Given the frequent changes in the schema, it is advisable to store the data in a column-oriented system such as HBASE. The Same HBASE data can be accessed through spark using HBase-Spark Connector. Ref - https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/accessing-hbase/topics/hbase-example-using-hbase-spark-connector.html
... View more