About ggangadharan

hanumanth · ‎07-13-2023

thank you all for your support

ggangadharan · ‎06-16-2023

Check the possibility of using hive managed table. As part of hive managed tables, you won't require separate merge job , as hive compaction takes care by default if compaction is enabled. You can access managed tables through HWC (Hive Warehouse Connector) from Spark.

ggangadharan · ‎06-15-2023

@Abdul_ As of now hive won't support row delimiter other new line character . Attaching the corresponding Jira for reference HIVE-11996 As a workaround, Recommend to update the input file using external libraries like awk,...etc and upload the input file in the corresponding FileSystem location to read. Eg - Through AWK [root@c2757-node2 ~]# awk -F "\",\"" 'NF < 3 {getline nextline; $0 = $0 nextline} 1' sample_case.txt "IM43163","SOUTH,OFC","10-Jan-23" "IM41763","John:comment added","12-Jan-23" [root@c2757-node2 ~]# awk -F "\",\"" 'NF < 3 {getline nextline; $0 = $0 nextline} 1' sample_case.txt > sample_text.csv Reading from Hive Table 0: jdbc:hive2://c2757-node2.coelab.cloudera.c> select * from table1; . . . INFO : Executing command(queryId=hive_20230616064136_333ff98d-636b-43b1-898d-fca66031fe7f): select * from table1 INFO : Completed executing command(queryId=hive_20230616064136_333ff98d-636b-43b1-898d-fca66031fe7f); Time taken: 0.023 seconds INFO : OK +---------------+---------------------+---------------+ | table1.col_1 | table1.col_2 | table1.col_3 | +---------------+---------------------+---------------+ | IM43163 | SOUTH,OFC | 10-Jan-23 | | IM41763 | John:comment added | 12-Jan-23 | +---------------+---------------------+---------------+ 2 rows selected (1.864 seconds)

Faflusniak · ‎06-15-2023

Thank you for your advise. We will investigate proposed solution with spark-xml Best regards

ggangadharan · ‎06-13-2023

Output of below to identify the exact ouptut records details, explain formatted <query> explain extended <query> explain analyze <query>

DianaTorres · ‎06-09-2023

@Josh2023 Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.

snm1523 · ‎06-07-2023

Thank you for the respone @ggangadharan

ggangadharan · ‎06-06-2023

Once the data has been read from database, you don't need to write the same data to file (i.e. CSV ) . Instead you can write directly into hive table using DataFrame API's. Once the Data has been loaded you query the same from hive. df.write.mode(SaveMode.Overwrite).saveAsTable("hive_records") Ref - https://spark.apache.org/docs/2.4.7/sql-data-sources-hive-tables.html Sample Code Snippet df = spark.read \ .format("jdbc") \ .option("url", "jdbc:postgresql://<server name>:5432/<DBNAME>") \ .option("dbtable", "\"<SourceTableName>\"") \ .option("user", "<Username>") \ .option("password", "<Password>") \ .option("driver", "org.postgresql.Driver") \ .load() df.write.mode('overwrite').saveAsTable("<TargetTableName>") From hive INFO : Compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10); Time taken: 0.591 seconds INFO : Executing command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES . . . +------+ | _c0 | +------+ | 122 | +------+

ggangadharan · ‎04-20-2023

It's working expected. Please find the below code snippet >>> columns = ["language","users_count"] >>> data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")] >>> df = spark.createDataFrame(data).toDF(*columns) >>> df.write.csv("/tmp/test") >>> df2=spark.read.csv("/tmp/test/*.csv") d>>> df2.show() +------+------+ | _c0| _c1| +------+------+ |Python|100000| | Scala| 3000| | Java| 20000| +------+------+

ggangadharan · ‎04-20-2023

From the error could see the query failed in MoveTask. MoveTask can be loading the partitions as well since the load statement belongs to the partitioned table, Along with HS2 logs HMS logs for the corresponding time period gives a better idea to identify the root cause of the failure. If it's just timeout issue, increase client socket timeout value.

Online	Offline
Last Visited	‎12-23-2025 01:30 AM

Member Since	‎09-16-2021 02:45 AM
Last Visited	‎12-23-2025 01:30 AM
Posts	422
Kudos received	55

Cloudera Community

Re: HWC on CDP 7.3.1 with Spark 3.5

Re: Using Hadoop Iceberg catalog with Hive engine ...

Re: Where can I find the Maven repository for HDP ...

Re: Failed with exception java.io.IOException:org....

Re: Hive on TEZ memory footprint and Impala stats...

Re: unable add hive aux jars from CM

Re: Automic operations on hive external tables

Re: Hive - Line Termination in Quotes

Re: How to automatically convert huge and complex ...

Re: Huge number of RECORDS_OUT_OPERATOR_RS_22

Re: Change default output filename part-r-00000.sn...

Re: All hive tables location

Re: Regarding data import into hive from csv

Re: Unable to convert a pyspark dataframe to CSV

Re: Error with socket timeout in CDP hive 3.1.3 wh...