<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Unable to map the data properly from a CSV file to a Hive table on HDFS in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Unable-to-map-the-data-properly-from-a-CSV-file-to-a-Hive/m-p/88880#M45524</link>
    <description>&lt;P&gt;There is something very unusual happening here. Based on your outputs, values are not only ending up in the wrong columns, but you are even getting different values!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In the 'correct' record, you have&amp;nbsp;5686.76, and in the 'wrong' record you have -5686.76.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My first guess was that there is a mistake in how you send data to the appropriate columns, but I don't see how that can explain a minus sign changing position.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;To troubleshoot something like this, it is really important to dig into the details. I would therefore recommend you to bring your question down to a 'Minimal reproducible example'. Eliminating any complexity that is not causing unexpected results.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For example: You show a load command to get data into spark, consider replacing it with an actual string (and make sure to check whether the string allows you to reproduce the problem).&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You also show 2 writes, but if we have the exact input and code to reproduce the problem the correct answer is probably not relevant.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Also, you use some code to list columns, consider hardcoding it first.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As mentioned, really try to take out all complexity untill we land on a minimal amount that still reproduces the problem.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hopefully you will already see the answer once you have eliminated all the distractions, and if not you will have a fully trimmed down version, which you can use to update your question here!&lt;/P&gt;</description>
    <pubDate>Tue, 09 Apr 2019 14:41:09 GMT</pubDate>
    <dc:creator>DennisJaheruddi</dc:creator>
    <dc:date>2019-04-09T14:41:09Z</dc:date>
    <item>
      <title>Unable to map the data properly from a CSV file to a Hive table on HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Unable-to-map-the-data-properly-from-a-CSV-file-to-a-Hive/m-p/86506#M45523</link>
      <description>&lt;P&gt;I am trying to load a dataframe into a Hive table by following the below steps:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Read the source table and save the dataframe as a CSV file on HDFS&lt;BR /&gt;
&lt;PRE&gt;val yearDF = spark.read.format("jdbc").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2016").option("user", devUserName).option("password", devPassword).option("partitionColumn","header_id").option("lowerBound", 199199).option("upperBound", 284058).option("numPartitions",10).load()&lt;/PRE&gt;
&lt;/LI&gt;
&lt;LI&gt;Order the columns as per my Hive table columns My hive table columns are present in a string in the format of:&lt;BR /&gt;
&lt;PRE&gt;val hiveCols          = "col1:coldatatype|col2:coldatatype|col3:coldatatype|col4:coldatatype...col200:datatype"
val schemaList        = hiveCols.split("\\|")
val hiveColumnOrder   = schemaList.map(e =&amp;gt; e.split("\\:")).map(e =&amp;gt; e(0)).toSeq
val finalDF           = yearDF.selectExpr(hiveColumnOrder:_*)&lt;/PRE&gt;
The order of columns that I read in "execQuery" are same as "hiveColumnOrder" and just to make sure of the order, I select the columns in yearDF once again using selectExpr&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;Saving the dataframe as a CSV file on HDFS:&lt;/P&gt;
&lt;PRE&gt;newDF.write.format("CSV").save("hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/")&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Once I save the dataframe, I take the same columns from "hiveCols", prepare a DDL to create a hive table on the same location with values being comma separated as given below:&lt;/P&gt;
&lt;PRE&gt;    create table if not exists schema.tablename(col1 coldatatype,col2 coldatatype,col3 coldatatype,col4 coldatatype...col200 datatype)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    LOCATION 'hdfs://username/apps/hive/warehouse/database.db/lines_test_data56/';&lt;/PRE&gt;
&lt;P&gt;After I load the dataframe into the table created, the problem I am facing here is when I query the table, I am getting improper output in the query. For ex: If I apply the below query on the dataframe before saving it as a file:&lt;/P&gt;
&lt;PRE&gt;finalDF.createOrReplaceTempView("tmpTable")
select header_id,line_num,debit_rate,debit_rate_text,credit_rate,credit_rate_text,activity_amount,activity_amount_text,exchange_rate,exchange_rate_text,amount_cr,amount_cr_text from tmpTable where header_id=19924598 and line_num=2&lt;/PRE&gt;
&lt;P&gt;I get the output properly. All the values are properly aligned to the columns:&lt;/P&gt;
&lt;PRE&gt;[19924598,2,null,null,381761.40000000000000000000,381761.4,-381761.40000000000000000000,-381761.4,0.01489610000000000000,0.014896100000000,5686.76000000000000000000,5686.76]&lt;/PRE&gt;
&lt;P&gt;But after saving the dataframe in a CSV file, create a table on top of it (step4) and apply the same query on the created table I see the data is jumbled and improperly mapped with the columns:&lt;/P&gt;
&lt;PRE&gt;select header_id,line_num,debit_rate,debit_rate_text,credit_rate,credit_rate_text,activity_amount,activity_amount_text,exchange_rate,exchange_rate_text,amount_cr,amount_cr_text from schema.tablename where header_id=19924598 and line_num=2

+---------------+--------------+-------------+------------------+-------------+------------------+--------------------------+-------------------------------+------------------------+-----------------------------+--------------------+-------------------------+--+
| header_id     | line_num     | debit_rate  | debit_rate_text  | credit_rate  | credit_rate_text  | activity_amount  | activity_amount_text  | exchange_rate  | exchange_rate_text  | amount_cr  | amount_cr_text  |
+---------------+--------------+-------------+------------------+-------------+------------------+--------------------------+-------------------------------+------------------------+-----------------------------+--------------------+-------------------------+--+
| 19924598      | 2            | NULL        |                  | 381761.4    |                    | 5686.76          | 5686.76               | NULL           | -5686.76            | NULL       |                 |&lt;/PRE&gt;
&lt;P&gt;&lt;BR /&gt;So I tried use a different approach where I created the hive table upfront and insert data into it from:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;dataframe:Running the DDL in step4 above&lt;/LI&gt;
&lt;LI&gt;finalDF.createOrReplaceTempView("tmpTable")&lt;/LI&gt;
&lt;LI&gt;spark.sql("insert into schema.table select * from tmpTable")&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;And even this way fails if I run the aforementioned select query once the job is completed. I tried to refresh the table using&lt;/P&gt;
&lt;PRE&gt;refresh table schema.table and msckrepair table schema.table&lt;/PRE&gt;
&lt;P&gt;just to see if there is any problem with the metadata but nothing seems to workout.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Could anyone let me know what is causing this phenomenon, is there is any problem with the way I operating the data here ?&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 14:09:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Unable-to-map-the-data-properly-from-a-CSV-file-to-a-Hive/m-p/86506#M45523</guid>
      <dc:creator>Sidhartha</dc:creator>
      <dc:date>2022-09-16T14:09:47Z</dc:date>
    </item>
    <item>
      <title>Re: Unable to map the data properly from a CSV file to a Hive table on HDFS</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Unable-to-map-the-data-properly-from-a-CSV-file-to-a-Hive/m-p/88880#M45524</link>
      <description>&lt;P&gt;There is something very unusual happening here. Based on your outputs, values are not only ending up in the wrong columns, but you are even getting different values!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In the 'correct' record, you have&amp;nbsp;5686.76, and in the 'wrong' record you have -5686.76.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My first guess was that there is a mistake in how you send data to the appropriate columns, but I don't see how that can explain a minus sign changing position.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;To troubleshoot something like this, it is really important to dig into the details. I would therefore recommend you to bring your question down to a 'Minimal reproducible example'. Eliminating any complexity that is not causing unexpected results.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For example: You show a load command to get data into spark, consider replacing it with an actual string (and make sure to check whether the string allows you to reproduce the problem).&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You also show 2 writes, but if we have the exact input and code to reproduce the problem the correct answer is probably not relevant.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Also, you use some code to list columns, consider hardcoding it first.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As mentioned, really try to take out all complexity untill we land on a minimal amount that still reproduces the problem.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hopefully you will already see the answer once you have eliminated all the distractions, and if not you will have a fully trimmed down version, which you can use to update your question here!&lt;/P&gt;</description>
      <pubDate>Tue, 09 Apr 2019 14:41:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Unable-to-map-the-data-properly-from-a-CSV-file-to-a-Hive/m-p/88880#M45524</guid>
      <dc:creator>DennisJaheruddi</dc:creator>
      <dc:date>2019-04-09T14:41:09Z</dc:date>
    </item>
  </channel>
</rss>

