Created 08-24-2016 08:55 PM
Consider the following scenario Hive table with 5 columns (col1, col2, col3, col4, col5) CSV file with 3 columns (col1, col3, col5) Now I want to load CSV file data into hive table with exact csv to hive column mapping as follows. hive csv data col1 <-> col1 col2 empty col3 <-> col3 col5 <-> col5 col4 empty Any kind of help would be greatly appreciate.
Created 08-24-2016 09:13 PM
My instict is that the default Hive SerDe would be used and would not automatically skip over the col2 value as you've shown in your example. A few options for you:
Cheers!
Created 08-24-2016 09:09 PM
You can consider using hive external table with the same column name and data types and then map the column names while loading from Source (hive external table) to Target (hive table).
You can get the examples here for creation of external table.
https://www.dezyre.com/hadoop-tutorial/apache-hive-tutorial-tables
Created 08-24-2016 09:19 PM
Your statement might look something like insert into csvinternal (col2) select col1 from cvsexternal;
Created 08-24-2016 09:13 PM
My instict is that the default Hive SerDe would be used and would not automatically skip over the col2 value as you've shown in your example. A few options for you:
Cheers!
Created 08-24-2016 09:14 PM
When creating table use the following:
TBLPROPERTIES ('serialization.null.format'='')
Then do INSERT INTO table_name (col1,col3, col5) select * from csvtable . Check the following. This should just work.
Created 08-24-2016 09:26 PM
Created 08-24-2016 10:43 PM
http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/
Check out section 3.4 from the tutorial using the reg-ex. This would help you load the table with columns you need.
Also, Another way is like @srai said, create an external table, mapped it to the the csv file. Create a managed table and insert the data using insert into managed table select from external table, explicitly state the columns you want to load with the insert statement.
Created 08-25-2016 09:50 PM
Even though the main requirement is addressed based on the choice of selected answer, thought I should log this for reference in the future:
0: jdbc:hive2://node1.hortonworks.com:10000/d> select * from src; +----------+------------+--+ | src.key | src.value | +----------+------------+--+ | 1 | Value1 | | 2 | Value2 | +----------+------------+--+ 2 rows selected (0.187 seconds) 0: jdbc:hive2://node1.hortonworks.com:10000/d> select * from tgt; +----------+------------+--+ | tgt.key | tgt.value | +----------+------------+--+ +----------+------------+--+ No rows selected (0.154 seconds) 0: jdbc:hive2://node1.hortonworks.com:10000/d> from (from src select transform(src.key,src.value) using '/bin/cat' as (tkey,tvalue) )tmap insert overwrite table tgt select tkey,tvalue; INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: from (from src select transfor...tkey,tvalue(Stage-1) INFO : INFO : Status: Running (Executing on YARN cluster with App id application_1471888656011_0009) INFO : Map 1: -/- INFO : Map 1: 0/1 INFO : Map 1: 0/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 1/1 INFO : Loading data to table default.tgt from hdfs://node1.hortonworks.com:8020/apps/hive/warehouse/tgt/.hive-staging_hive_2016-08-25_21-51-10_715_1000932141605500109-1/-ext-10000 INFO : Table default.tgt stats: [numFiles=1, numRows=2, totalSize=18, rawDataSize=16] No rows affected (19.992 seconds) 0: jdbc:hive2://node1.hortonworks.com:10000/d> select * from tgt; +----------+------------+--+ | tgt.key | tgt.value | +----------+------------+--+ | 1 | Value1 | | 2 | Value2 | +----------+------------+--+ 2 rows selected (0.197 seconds)