Created on 04-17-2023 03:03 AM - edited 04-17-2023 03:08 AM
My data set is as below.
"IM43163","SOUTH,OFC","10-Jan-23"
"IM41763","John:
comment added","12-Jan-23"
CREATE EXTERNAL TABLE database.table1( `col_1` string, `col_2` string, `col_3` string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar"="," , "quoteChar"="\"")
Since the record is splitted amont two rows therefore it is not loading properly and gives null values. The output I am getting is mentioned below.
IM43163 | SOUTH,OFC | 10-Jan-23 |
IM41763 | ? | ? |
? | ? | ? |
Created 04-17-2023 09:33 PM
Hi @Abdul_ ,
It looks like the issue is that the data contains a newline character (\n) within a field value, which causes the record to be split into two rows, causing the problem.
Can you modify the data to remove "\n" from the sample data? In that case, the create statement that you are using is correct.
However, if data modification is impossible, you may use "LazySimpleSerDe". However, it may not be as performant as the OpenCSVSerde for large datasets.
Hope this helps,
Tarun
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs-up button.
Created 04-17-2023 11:07 PM
Hi @tj2007, It is not possible to modify the data. I have tried the "LazySimpleSerDe" but it didn't give the correct output (Mentioned below).
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties ( "separatorChar" = ",", "quoteChar" = '\""') STORED AS TEXTFILE
"IM43163","SOUTH,OFC","10-Jan-23" | ? | ? |
"IM41763","John: | ? | ? |
comment added","12-Jan-23" | ? | ? |
However, I need output like this.
IM43163 | SOUTH,OFC | 10-Jan-23 |
IM41763 | John:comment added | 12-Jan-23 |
Please also note that the input file is a CSV file which I am successfully able to open it in Excel. Your support will be highly appreciated.
Created 06-15-2023 11:47 PM
@Abdul_
As of now hive won't support row delimiter other new line character . Attaching the corresponding Jira for reference HIVE-11996
As a workaround, Recommend to update the input file using external libraries like awk,...etc and upload the input file in the corresponding FileSystem location to read.
Eg -
Through AWK
[root@c2757-node2 ~]# awk -F "\",\"" 'NF < 3 {getline nextline; $0 = $0 nextline} 1' sample_case.txt
"IM43163","SOUTH,OFC","10-Jan-23"
"IM41763","John:comment added","12-Jan-23"
[root@c2757-node2 ~]# awk -F "\",\"" 'NF < 3 {getline nextline; $0 = $0 nextline} 1' sample_case.txt > sample_text.csv
Reading from Hive Table
0: jdbc:hive2://c2757-node2.coelab.cloudera.c> select * from table1;
.
.
.
INFO : Executing command(queryId=hive_20230616064136_333ff98d-636b-43b1-898d-fca66031fe7f): select * from table1
INFO : Completed executing command(queryId=hive_20230616064136_333ff98d-636b-43b1-898d-fca66031fe7f); Time taken: 0.023 seconds
INFO : OK
+---------------+---------------------+---------------+
| table1.col_1 | table1.col_2 | table1.col_3 |
+---------------+---------------------+---------------+
| IM43163 | SOUTH,OFC | 10-Jan-23 |
| IM41763 | John:comment added | 12-Jan-23 |
+---------------+---------------------+---------------+
2 rows selected (1.864 seconds)