Member since
09-16-2021
421
Posts
55
Kudos Received
39
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 453 | 10-22-2025 05:48 AM | |
| 442 | 09-05-2025 07:19 AM | |
| 1027 | 07-15-2025 02:22 AM | |
| 1621 | 06-02-2025 06:55 AM | |
| 1779 | 05-22-2025 03:00 AM |
06-16-2023
12:04 AM
Check the possibility of using hive managed table. As part of hive managed tables, you won't require separate merge job , as hive compaction takes care by default if compaction is enabled. You can access managed tables through HWC (Hive Warehouse Connector) from Spark.
... View more
06-15-2023
11:47 PM
@Abdul_ As of now hive won't support row delimiter other new line character . Attaching the corresponding Jira for reference HIVE-11996 As a workaround, Recommend to update the input file using external libraries like awk,...etc and upload the input file in the corresponding FileSystem location to read. Eg - Through AWK [root@c2757-node2 ~]# awk -F "\",\"" 'NF < 3 {getline nextline; $0 = $0 nextline} 1' sample_case.txt
"IM43163","SOUTH,OFC","10-Jan-23"
"IM41763","John:comment added","12-Jan-23"
[root@c2757-node2 ~]# awk -F "\",\"" 'NF < 3 {getline nextline; $0 = $0 nextline} 1' sample_case.txt > sample_text.csv Reading from Hive Table 0: jdbc:hive2://c2757-node2.coelab.cloudera.c> select * from table1;
.
.
.
INFO : Executing command(queryId=hive_20230616064136_333ff98d-636b-43b1-898d-fca66031fe7f): select * from table1
INFO : Completed executing command(queryId=hive_20230616064136_333ff98d-636b-43b1-898d-fca66031fe7f); Time taken: 0.023 seconds
INFO : OK
+---------------+---------------------+---------------+
| table1.col_1 | table1.col_2 | table1.col_3 |
+---------------+---------------------+---------------+
| IM43163 | SOUTH,OFC | 10-Jan-23 |
| IM41763 | John:comment added | 12-Jan-23 |
+---------------+---------------------+---------------+
2 rows selected (1.864 seconds)
... View more
06-15-2023
04:26 AM
Thank you for your advise. We will investigate proposed solution with spark-xml Best regards
... View more
06-13-2023
12:12 AM
Output of below to identify the exact ouptut records details, explain formatted <query> explain extended <query> explain analyze <query>
... View more
06-09-2023
11:59 AM
@Josh2023 Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.
... View more
06-06-2023
09:37 PM
Once the data has been read from database, you don't need to write the same data to file (i.e. CSV ) . Instead you can write directly into hive table using DataFrame API's. Once the Data has been loaded you query the same from hive. df.write.mode(SaveMode.Overwrite).saveAsTable("hive_records") Ref - https://spark.apache.org/docs/2.4.7/sql-data-sources-hive-tables.html Sample Code Snippet df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://<server name>:5432/<DBNAME>") \
.option("dbtable", "\"<SourceTableName>\"") \
.option("user", "<Username>") \
.option("password", "<Password>") \
.option("driver", "org.postgresql.Driver") \
.load()
df.write.mode('overwrite').saveAsTable("<TargetTableName>")
From hive
INFO : Compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10); Time taken: 0.591 seconds
INFO : Executing command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES
.
.
.
+------+
| _c0 |
+------+
| 122 |
+------+
... View more
04-20-2023
10:47 PM
It's working expected. Please find the below code snippet >>> columns = ["language","users_count"]
>>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
>>> df = spark.createDataFrame(data).toDF(*columns)
>>> df.write.csv("/tmp/test")
>>> df2=spark.read.csv("/tmp/test/*.csv")
d>>> df2.show()
+------+------+
| _c0| _c1|
+------+------+
|Python|100000|
| Scala| 3000|
| Java| 20000|
+------+------+
... View more
04-20-2023
05:31 AM
From the error could see the query failed in MoveTask. MoveTask can be loading the partitions as well since the load statement belongs to the partitioned table, Along with HS2 logs HMS logs for the corresponding time period gives a better idea to identify the root cause of the failure. If it's just timeout issue, increase client socket timeout value.
... View more
10-13-2022
02:46 AM
@Sunil1359 Compilation might be higher if the table has a large number of partitions or if the HMS process is slow when the query runs. Please check the below on the corresponding time period to find the root cause. HS2 log HMS log HMS jstack In Tez engine queries will run in the form of DAG. In the compilation phase, once the semantic analysis process is completed, the plan will be generated depending on the query you submitted. explain <your query> gives the plan of the query. Once the plan is generated DAG will be submitted to yarn and the DAG will run depending on the plan. As part of DAG, Split generation, input file read, shuffle fetch ..etc will be taken care and the end result will be transferred to the client.
... View more
- « Previous
- Next »