Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Unable to map the data properly from a CSV file to a Hive table on HDFS

avatar
Explorer

I am trying to load a dataframe into a Hive table by following the below steps:

  • Read the source table and save the dataframe as a CSV file on HDFS
    val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2016").option("user", devUserName).option("password", devPassword).option("partitionColumn","header_id").option("lowerBound", 199199).option("upperBound", 284058).option("numPartitions",10).load()
  • Order the columns as per my Hive table columns My hive table columns are present in a string in the format of:
    val hiveCols          = "col1:coldatatype|col2:coldatatype|col3:coldatatype|col4:coldatatype...col200:datatype"
    val schemaList        = hiveCols.split("\\|")
    val hiveColumnOrder   = schemaList.map(e => e.split("\\:")).map(e => e(0)).toSeq
    val finalDF           = yearDF.selectExpr(hiveColumnOrder:_*)
    The order of columns that I read in "execQuery" are same as "hiveColumnOrder" and just to make sure of the order, I select the columns in yearDF once again using selectExpr
  • Saving the dataframe as a CSV file on HDFS:

    newDF.write.format("CSV").save("hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/")

     

Once I save the dataframe, I take the same columns from "hiveCols", prepare a DDL to create a hive table on the same location with values being comma separated as given below:

    create table if not exists schema.tablename(col1 coldatatype,col2 coldatatype,col3 coldatatype,col4 coldatatype...col200 datatype)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    LOCATION 'hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/';

After I load the dataframe into the table created, the problem I am facing here is when I query the table, I am getting improper output in the query. For ex: If I apply the below query on the dataframe before saving it as a file:

finalDF.createOrReplaceTempView("tmpTable")
select header_id,line_num,debit_rate,debit_rate_text,credit_rate,credit_rate_text,activity_amount,activity_amount_text,exchange_rate,exchange_rate_text,amount_cr,amount_cr_text from tmpTable where header_id=19924598 and line_num=2

I get the output properly. All the values are properly aligned to the columns:

[19924598,2,null,null,381761.40000000000000000000,381761.4,-381761.40000000000000000000,-381761.4,0.01489610000000000000,0.014896100000000,5686.76000000000000000000,5686.76]

But after saving the dataframe in a CSV file, create a table on top of it (step4) and apply the same query on the created table I see the data is jumbled and improperly mapped with the columns:

select header_id,line_num,debit_rate,debit_rate_text,credit_rate,credit_rate_text,activity_amount,activity_amount_text,exchange_rate,exchange_rate_text,amount_cr,amount_cr_text from schema.tablename where header_id=19924598 and line_num=2

+---------------+--------------+-------------+------------------+-------------+------------------+--------------------------+-------------------------------+------------------------+-----------------------------+--------------------+-------------------------+--+
| header_id     | line_num     | debit_rate  | debit_rate_text  | credit_rate  | credit_rate_text  | activity_amount  | activity_amount_text  | exchange_rate  | exchange_rate_text  | amount_cr  | amount_cr_text  |
+---------------+--------------+-------------+------------------+-------------+------------------+--------------------------+-------------------------------+------------------------+-----------------------------+--------------------+-------------------------+--+
| 19924598      | 2            | NULL        |                  | 381761.4    |                    | 5686.76          | 5686.76               | NULL           | -5686.76            | NULL       |                 |


So I tried use a different approach where I created the hive table upfront and insert data into it from:

  • dataframe:Running the DDL in step4 above
  • finalDF.createOrReplaceTempView("tmpTable")
  • spark.sql("insert into schema.table select * from tmpTable")

And even this way fails if I run the aforementioned select query once the job is completed. I tried to refresh the table using

refresh table schema.table and msckrepair table schema.table

just to see if there is any problem with the metadata but nothing seems to workout.

 

Could anyone let me know what is causing this phenomenon, is there is any problem with the way I operating the data here ?

1 ACCEPTED SOLUTION

avatar

There is something very unusual happening here. Based on your outputs, values are not only ending up in the wrong columns, but you are even getting different values!

 

In the 'correct' record, you have 5686.76, and in the 'wrong' record you have -5686.76.

 

My first guess was that there is a mistake in how you send data to the appropriate columns, but I don't see how that can explain a minus sign changing position.

 

To troubleshoot something like this, it is really important to dig into the details. I would therefore recommend you to bring your question down to a 'Minimal reproducible example'. Eliminating any complexity that is not causing unexpected results.

 

For example: You show a load command to get data into spark, consider replacing it with an actual string (and make sure to check whether the string allows you to reproduce the problem). 

 

You also show 2 writes, but if we have the exact input and code to reproduce the problem the correct answer is probably not relevant.

 

Also, you use some code to list columns, consider hardcoding it first.

 

As mentioned, really try to take out all complexity untill we land on a minimal amount that still reproduces the problem.

 

Hopefully you will already see the answer once you have eliminated all the distractions, and if not you will have a fully trimmed down version, which you can use to update your question here!


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

View solution in original post

1 REPLY 1

avatar

There is something very unusual happening here. Based on your outputs, values are not only ending up in the wrong columns, but you are even getting different values!

 

In the 'correct' record, you have 5686.76, and in the 'wrong' record you have -5686.76.

 

My first guess was that there is a mistake in how you send data to the appropriate columns, but I don't see how that can explain a minus sign changing position.

 

To troubleshoot something like this, it is really important to dig into the details. I would therefore recommend you to bring your question down to a 'Minimal reproducible example'. Eliminating any complexity that is not causing unexpected results.

 

For example: You show a load command to get data into spark, consider replacing it with an actual string (and make sure to check whether the string allows you to reproduce the problem). 

 

You also show 2 writes, but if we have the exact input and code to reproduce the problem the correct answer is probably not relevant.

 

Also, you use some code to list columns, consider hardcoding it first.

 

As mentioned, really try to take out all complexity untill we land on a minimal amount that still reproduces the problem.

 

Hopefully you will already see the answer once you have eliminated all the distractions, and if not you will have a fully trimmed down version, which you can use to update your question here!


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.