Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Remove double quotation in impala output

avatar
New Contributor

I have a csv data which I have to load in impala/hive. does anyone knows how to remove the double quotation mark in the output? Here is my sample create table scripts.

 

CREATE EXTERNAL TABLE schema.table (   id int,   name STRING,   desc STRING,   desc1 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' WITH SERDEPROPERTIES ('field.delim'='|', 'serialization.format'='|') STORED AS TEXTFILE LOCATION '/user/folder/schema/table'

6 REPLIES 6

avatar
Super Collaborator

Hi,

 

Can you provide the sample output for the same where you are seeing the double quotation mark?

 

Regards,

Chethan YM

avatar
New Contributor

Hello,

 

This is the sample output:

 

Screenshot 2021-08-13 084332.png

avatar
Super Collaborator

Hi @lalala ,

 

This behavior was caused by the csv module when impala is using it to export the data.

 

    # csv.writer expects a file handle to the input.

    # cStringIO is used as the temporary buffer.

    temp_buffer = StringIO()

    writer = csv.writer(temp_buffer, delimiter=self.field_delim,

                        lineterminator='\n', quoting=csv.QUOTE_MINIMAL)

    writer.writerows(rows)

 

Seems to be we cannot change this since it is needs to modified at code level.

 

[1]. 

https://github.com/apache/impala/blob/014c455aaaa38010ae706228f7b439c080c0bc7d/shell/shell_output.py...

avatar
New Contributor

1. in Hive  we can use OpenCSVSerde to remove the double quotation marks, however this is not compatible with Impala. is there any equivalent OpenCSVSerde in Impala?

 

2.ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' WITH SERDEPROPERTIES ( "separatorChar" = "|" , "quoteChar" = "\"" ) is not also working.

avatar
Expert Contributor

Hello @lalala @ChethanYM did you find any solution to remove double-quote in impala output on a csv external file table.
Input external .gz file has row:
true,false,"US","Los Angeles","California","Cloudflare"

Output of select * from impala external table defined on .gz file above shows:
true false "US" "Los Angeles" "California" "Cloudflare"

 

FYI in AWS Athena doc it says: 

https://docs.aws.amazon.com/athena/latest/ug/lazy-simple-serde.html

LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files

 

Use this SerDe if your data does not have values enclosed in quotes.


How to remove the double-quotes.

avatar
Expert Contributor

UPDATE:

One possible workaround to suppress these quotes from displaying in select * is to create a view like below in Impala:

 

CREATE VIEW db1.view1 AS
SELECT replace(table1.quotedcol1, '"', '') quotedcol1, replace(table1.quotedcol2, '"', '') quotedcol2
FROM db1.table1;