Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Problem in reading CSV file using Apache Spark

avatar
Explorer

Hi,

I have a CSV file(Pipe Delimited fields) with 3 columns. The value in the 2nd column has some new lines & quoted string in it as shown below,

 

"2020-02-23 11:15:39"|"

Hi Craig,

Please approve the standard pricing.

 

No further amendments made "Legal System."

 

Justification

-XXX is the sole owner in China

Thank you.

"|"Approved"

 

I am trying to read the file using "spark read csv" API, but it is not able to read/parse the file correctly. I am using spark 2.3.0 version. Below are the commands used,

 

val path = "/user/1234/abc.csv"
val inputDf = spark.read.option("delimiter","|").option("wholeFile",true).option("multiline",true).option("header",false).option("inferSchema",false).csv(path)

 

 

Could please help me out?

Note:- There is a word within quotes in the second field("Legal System.")

 

Thanks

3 REPLIES 3

avatar
Explorer

Any help on the above query?

avatar
Expert Contributor

Hello @ravisro ,

 

I don't think there would be a straight forward way for this, (i.e) we might need to perform some sort of data cleansing work prior feeding it to spark in my view. Possibly the inputs shared does contains new line characters (\n) which might make spark to confuse with the data and new lines. 

 

I did some sort of data cleansing, (i.e) removing newlines gave me below result

 

inputDf.show(false)
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|_c0                |_c1                                                                                                                                                |_c2     |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|2020-02-23 11:15:39|"Hi Craig, Please approve the standard pricing. No further amendments made "Legal System."Justification -XXX is the sole owner in China Thank you."|Approved|
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------+

 while running the same code from spark.

Thanks,
Satz

avatar
New Contributor

@satz

I have similar issue, Is it possible to share data cleansing - removing newlines coding snippet.