Support Questions

girish02c · ‎08-24-2016

Consider the following scenario

Hive table with 5 columns (col1, col2, col3, col4, col5)
CSV file with 3 columns (col1, col3, col5)

Now I want to load CSV file data into hive table with exact csv to hive column mapping as follows.

hive 		csv data
col1     <-> 	col1
col2 		empty
col3     <-> 	col3
col5     <-> 	col5
col4 		empty

Any kind of help would be greatly appreciate.

wfloyd · ‎08-24-2016

My instict is that the default Hive SerDe would be used and would not automatically skip over the col2 value as you've shown in your example. A few options for you:

Ingest the raw CSV data into a 3 column temp Hive table. Perform an "Insert ... Select * from temp_hive_table" to push those three column values into your destination Hive table.
Write a brief Pig script to parse the CSV table and push to your destination Hive table
Write your own Hive SerDe - https://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inandCustomSerDes

Cheers!

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,Stor...

View solution in original post

srai1 · ‎08-24-2016

You can consider using hive external table with the same column name and data types and then map the column names while loading from Source (hive external table) to Target (hive table).

You can get the examples here for creation of external table.

https://www.dezyre.com/hadoop-tutorial/apache-hive-tutorial-tables

srai1 · ‎08-24-2016

Your statement might look something like insert into csvinternal (col2) select col1 from cvsexternal;

wfloyd · ‎08-24-2016

My instict is that the default Hive SerDe would be used and would not automatically skip over the col2 value as you've shown in your example. A few options for you:

Ingest the raw CSV data into a 3 column temp Hive table. Perform an "Insert ... Select * from temp_hive_table" to push those three column values into your destination Hive table.
Write a brief Pig script to parse the CSV table and push to your destination Hive table
Write your own Hive SerDe - https://cwiki.apache.org/confluence/display/Hive/SerDe#SerDe-Built-inandCustomSerDes

Cheers!

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormat,Stor...

mqureshi · ‎08-24-2016

When creating table use the following:

TBLPROPERTIES ('serialization.null.format'='')

Then do INSERT INTO table_name (col1,col3, col5) select * from csvtable . Check the following. This should just work.

https://community.hortonworks.com/questions/1216/techniques-for-dealing-with-malformed-data-hive.htm...

girish02c · ‎08-24-2016

@Wes Floyd @srai @mqureshi Thank you so much for your quick responses to my question. As I already have staging table to orc table implementation structure I will try to use csv column headers to create staging tables and then I will load staging data to actual table.

sbomma · ‎08-24-2016

http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/

Check out section 3.4 from the tutorial using the reg-ex. This would help you load the table with columns you need.

Also, Another way is like @srai said, create an external table, mapped it to the the csv file. Create a managed table and insert the data using insert into managed table select from external table, explicitly state the columns you want to load with the insert statement.

srai1 · ‎08-25-2016

Even though the main requirement is addressed based on the choice of selected answer, thought I should log this for reference in the future:

0: jdbc:hive2://node1.hortonworks.com:10000/d> select * from src;
+----------+------------+--+
| src.key  | src.value  |
+----------+------------+--+
| 1        | Value1     |
| 2        | Value2     |
+----------+------------+--+
2 rows selected (0.187 seconds)
0: jdbc:hive2://node1.hortonworks.com:10000/d> select * from tgt;
+----------+------------+--+
| tgt.key  | tgt.value  |
+----------+------------+--+
+----------+------------+--+
No rows selected (0.154 seconds)
0: jdbc:hive2://node1.hortonworks.com:10000/d> from (from src select transform(src.key,src.value) using '/bin/cat' as (tkey,tvalue) )tmap insert overwrite table tgt select tkey,tvalue;
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: from (from src select transfor...tkey,tvalue(Stage-1)
INFO  : 


INFO  : Status: Running (Executing on YARN cluster with App id application_1471888656011_0009)


INFO  : Map 1: -/-	
INFO  : Map 1: 0/1	
INFO  : Map 1: 0/1	
INFO  : Map 1: 0(+1)/1	
INFO  : Map 1: 1/1	
INFO  : Loading data to table default.tgt from hdfs://node1.hortonworks.com:8020/apps/hive/warehouse/tgt/.hive-staging_hive_2016-08-25_21-51-10_715_1000932141605500109-1/-ext-10000
INFO  : Table default.tgt stats: [numFiles=1, numRows=2, totalSize=18, rawDataSize=16]
No rows affected (19.992 seconds)
0: jdbc:hive2://node1.hortonworks.com:10000/d> select * from tgt;
+----------+------------+--+
| tgt.key  | tgt.value  |
+----------+------------+--+
| 1        | Value1     |
| 2        | Value2     |
+----------+------------+--+
2 rows selected (0.197 seconds)

Cloudera Community

Support Questions

Is there is any workaround to map csv columns to hive columns?

Extract a column from csv

Using Ranger Mask Hive Columns for Users

NIFI processor for converting a numeric type to st...

Hive ARRAY column INSERT INTO SELECT FROM exceptio...

Support Video: How to mask Hive columns using Atla...

How to extract csv column record and used it for ...

How to use Column Masking and Row Filtering

Dynamic Column insertion in hbase

Map Hive jobs to YARN queues

Workaround for SmartSense capture bundle timeout