Support Questions

Find answers, ask questions, and share your expertise

Regarding data import into hive from csv

avatar
New Contributor

I have issue in importing the data from dataframe converted to csv then uploading it into hive but its not loading properly .

My procedure:

1st I took a Data frame from database and converted into a csv ,which has 343 columns and 24 lakhs rows 

2nd I took the csv file to hive and I loaded the data to hive using load data code to table which i created directly by connect the hive to same database .

this is what ,I am doing.

In this case , my issue is for some rows it taking proper values but for some is null  or 0.

then i took a sample of 5 rows and I checked manually then i find out in csv file some rows there are some extra comma .so I manually removed and tried ,it worked but this cant be happening in real-time .

so pls help me on this by giving some suggestion.

3 REPLIES 3

avatar
Community Manager

@jijy, Welcome to our community! To help you get the best possible answer, I have tagged our Hive experts @smruti@asish@Asok@tjangid  who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Super Collaborator

Hello @jijy,

 

could you please share your create table statement and some sample data?

 

Regards

avatar
Super Collaborator

Once the data has been read from database, you don't need to write the same data to  file (i.e. CSV ) .  Instead you can write directly into hive table using DataFrame API's.  Once the Data has been loaded you query the same from hive. 

 

df.write.mode(SaveMode.Overwrite).saveAsTable("hive_records")

 

 
Ref - https://spark.apache.org/docs/2.4.7/sql-data-sources-hive-tables.html


Sample Code Snippet 

 

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://<server name>:5432/<DBNAME>") \
    .option("dbtable", "\"<SourceTableName>\"") \
    .option("user", "<Username>") \
    .option("password", "<Password>") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.write.mode('overwrite').saveAsTable("<TargetTableName>")


From hive 

INFO  : Compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10); Time taken: 0.591 seconds
INFO  : Executing command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES
.
.
.
+------+
| _c0  |
+------+
| 122  |
+------+