Created on 09-30-2021 07:33 AM - edited 09-30-2021 07:35 AM
I need to load the CSV data into hive table but i am facing issues with embedded double quotes in few column values as well embedded commas in other columns .
Because of this, wherever embedded double quotes and embedded commas are occured , the data from there not loading properly and filled with nulls.
I have tried using below openCSV serde options and all. But no luck!
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
theCSV data looks like this
id,name,description,role
"1","peter","He is Data enginer"", "Senior Engineer"
"2","Anee","Hadoop Engineer"","Lead"
"3","James","Data, Architect"","Sr Architect"
hive data should look like this after loading
Id | Name | Description | Role |
1 | peter | He is Data Engineer | Senior Engineer |
2 | anee | Hadoop Engineer | Lead |
3 | james | Data, Architect | Sr Architect |
Created on 09-30-2021 10:30 AM - edited 09-30-2021 10:31 AM
I am able to acheive,please find my testcase
abc.txt
====
"1","peter","He is Data enginer", "Senior Engineer"
"2","Anee","Hadoop Engineer","Lead"
"3","James","Data, Architect","Sr Architect"
hdfs dfs -put abc.txt /user/hive/
create external table test_csv(num string ,name string ,work string ,designation string )
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties
(
"separatorChar" = ','
,"quoteChar" = '"'
)
STORED AS TEXTFILE;
LOAD DATA INPATH '/user/hive/abc.txt' INTO TABLE test_csv;
select * from test_csv
+---------------+----------------+---------------------+-----------------------+
| test_csv.num | test_csv.name | test_csv.work | test_csv.designation |
+---------------+----------------+---------------------+-----------------------+
| 1 | peter | He is Data enginer | Senior Engineer |
| 2 | Anee | Hadoop Engineer | Lead |
| 3 | James | Data, Architect | Sr Architect |
| 1 | peter | He is Data enginer | Senior Engineer |
| 2 | Anee | Hadoop Engineer | Lead |
| 3 | James | Data, Architect | Sr Architect |
+---------------+----------------+---------------------+-----------------------+
Please accept it as solution,if your queries are answered and the testcase works in your scenario.
Created on 09-30-2021 10:30 AM - edited 09-30-2021 10:31 AM
I am able to acheive,please find my testcase
abc.txt
====
"1","peter","He is Data enginer", "Senior Engineer"
"2","Anee","Hadoop Engineer","Lead"
"3","James","Data, Architect","Sr Architect"
hdfs dfs -put abc.txt /user/hive/
create external table test_csv(num string ,name string ,work string ,designation string )
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties
(
"separatorChar" = ','
,"quoteChar" = '"'
)
STORED AS TEXTFILE;
LOAD DATA INPATH '/user/hive/abc.txt' INTO TABLE test_csv;
select * from test_csv
+---------------+----------------+---------------------+-----------------------+
| test_csv.num | test_csv.name | test_csv.work | test_csv.designation |
+---------------+----------------+---------------------+-----------------------+
| 1 | peter | He is Data enginer | Senior Engineer |
| 2 | Anee | Hadoop Engineer | Lead |
| 3 | James | Data, Architect | Sr Architect |
| 1 | peter | He is Data enginer | Senior Engineer |
| 2 | Anee | Hadoop Engineer | Lead |
| 3 | James | Data, Architect | Sr Architect |
+---------------+----------------+---------------------+-----------------------+
Please accept it as solution,if your queries are answered and the testcase works in your scenario.
Created 09-30-2021 10:39 AM
Thanks for your response Ashish,
If you see my test data , its having two double quotes (extra double quote ) in column values . thats where its creating problems.
"1","peter","He is Data enginer"", "Senior Engineer"
"2","Anee","Hadoop Engineer"","Lead"
"3","James","Data, Architect"","Sr Architect"
Created 09-30-2021 10:48 AM
yes that was indeed a problem 🙂
I was about to comment on that