Created 05-08-2023 12:16 AM
I have issue in importing the data from dataframe converted to csv then uploading it into hive but its not loading properly .
My procedure:
1st I took a Data frame from database and converted into a csv ,which has 343 columns and 24 lakhs rows
2nd I took the csv file to hive and I loaded the data to hive using load data code to table which i created directly by connect the hive to same database .
this is what ,I am doing.
In this case , my issue is for some rows it taking proper values but for some is null or 0.
then i took a sample of 5 rows and I checked manually then i find out in csv file some rows there are some extra comma .so I manually removed and tried ,it worked but this cant be happening in real-time .
so pls help me on this by giving some suggestion.
Created 05-08-2023 02:23 AM
@jijy, Welcome to our community! To help you get the best possible answer, I have tagged our Hive experts @smruti, @asish, @Asok, @tjangid who may be able to assist you further.
Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.
Regards,
Vidya Sargur,Created 05-08-2023 02:27 AM
Created on 06-06-2023 09:37 PM - edited 06-06-2023 09:38 PM
Once the data has been read from database, you don't need to write the same data to file (i.e. CSV ) . Instead you can write directly into hive table using DataFrame API's. Once the Data has been loaded you query the same from hive.
df.write.mode(SaveMode.Overwrite).saveAsTable("hive_records")
Ref - https://spark.apache.org/docs/2.4.7/sql-data-sources-hive-tables.html
Sample Code Snippet
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://<server name>:5432/<DBNAME>") \
.option("dbtable", "\"<SourceTableName>\"") \
.option("user", "<Username>") \
.option("password", "<Password>") \
.option("driver", "org.postgresql.Driver") \
.load()
df.write.mode('overwrite').saveAsTable("<TargetTableName>")
From hive
INFO : Compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10); Time taken: 0.591 seconds
INFO : Executing command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES
.
.
.
+------+
| _c0 |
+------+
| 122 |
+------+