Support Questions

Find answers, ask questions, and share your expertise

Is there is any workaround to map csv columns to hive columns?

avatar
Contributor
Consider the following scenario

Hive table with 5 columns (col1, col2, col3, col4, col5)
CSV file with 3 columns (col1, col3, col5)

Now I want to load CSV file data into hive table with exact csv to hive column mapping as follows.

hive 		csv data
col1     <-> 	col1
col2 		empty
col3     <-> 	col3
col5     <-> 	col5
col4 		empty

Any kind of help would be greatly appreciate.
1 ACCEPTED SOLUTION

avatar
Expert Contributor

My instict is that the default Hive SerDe would be used and would not automatically skip over the col2 value as you've shown in your example. A few options for you:

  1. Ingest the raw CSV data into a 3 column temp Hive table. Perform an "Insert ... Select * from temp_hive_table" to push those three column values into your destination Hive table.
  2. Write a brief Pig script to parse the CSV table and push to your destination Hive table
  3. Write your own Hive SerDe - https://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inandCustomSerDes

Cheers!

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,Stor...

View solution in original post

7 REPLIES 7

avatar
Guru

You can consider using hive external table with the same column name and data types and then map the column names while loading from Source (hive external table) to Target (hive table).

You can get the examples here for creation of external table.

https://www.dezyre.com/hadoop-tutorial/apache-hive-tutorial-tables

avatar
Guru

Your statement might look something like insert into csvinternal (col2) select col1 from cvsexternal;

avatar
Expert Contributor

My instict is that the default Hive SerDe would be used and would not automatically skip over the col2 value as you've shown in your example. A few options for you:

  1. Ingest the raw CSV data into a 3 column temp Hive table. Perform an "Insert ... Select * from temp_hive_table" to push those three column values into your destination Hive table.
  2. Write a brief Pig script to parse the CSV table and push to your destination Hive table
  3. Write your own Hive SerDe - https://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inandCustomSerDes

Cheers!

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,Stor...

avatar
Super Guru

When creating table use the following:

TBLPROPERTIES ('serialization.null.format'='')

Then do INSERT INTO table_name (col1,col3, col5) select * from csvtable . Check the following. This should just work.

https://community.hortonworks.com/questions/1216/techniques-for-dealing-with-malformed-data-hive.htm...

avatar
Contributor
@Wes Floyd @srai @mqureshi Thank you so much for your quick responses to my question. As I already have staging table to orc table implementation structure I will try to use csv column headers to create staging tables and then I will load staging data to actual table.

avatar
Super Collaborator

http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/

Check out section 3.4 from the tutorial using the reg-ex. This would help you load the table with columns you need.

Also, Another way is like @srai said, create an external table, mapped it to the the csv file. Create a managed table and insert the data using insert into managed table select from external table, explicitly state the columns you want to load with the insert statement.

avatar
Guru

Even though the main requirement is addressed based on the choice of selected answer, thought I should log this for reference in the future:

0: jdbc:hive2://node1.hortonworks.com:10000/d> select * from src;
+----------+------------+--+
| src.key  | src.value  |
+----------+------------+--+
| 1        | Value1     |
| 2        | Value2     |
+----------+------------+--+
2 rows selected (0.187 seconds)
0: jdbc:hive2://node1.hortonworks.com:10000/d> select * from tgt;
+----------+------------+--+
| tgt.key  | tgt.value  |
+----------+------------+--+
+----------+------------+--+
No rows selected (0.154 seconds)
0: jdbc:hive2://node1.hortonworks.com:10000/d> from (from src select transform(src.key,src.value) using '/bin/cat' as (tkey,tvalue) )tmap insert overwrite table tgt select tkey,tvalue;
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: from (from src select transfor...tkey,tvalue(Stage-1)
INFO  : 


INFO  : Status: Running (Executing on YARN cluster with App id application_1471888656011_0009)


INFO  : Map 1: -/-	
INFO  : Map 1: 0/1	
INFO  : Map 1: 0/1	
INFO  : Map 1: 0(+1)/1	
INFO  : Map 1: 1/1	
INFO  : Loading data to table default.tgt from hdfs://node1.hortonworks.com:8020/apps/hive/warehouse/tgt/.hive-staging_hive_2016-08-25_21-51-10_715_1000932141605500109-1/-ext-10000
INFO  : Table default.tgt stats: [numFiles=1, numRows=2, totalSize=18, rawDataSize=16]
No rows affected (19.992 seconds)
0: jdbc:hive2://node1.hortonworks.com:10000/d> select * from tgt;
+----------+------------+--+
| tgt.key  | tgt.value  |
+----------+------------+--+
| 1        | Value1     |
| 2        | Value2     |
+----------+------------+--+
2 rows selected (0.197 seconds)