Created on 03-04-2020 10:04 AM - last edited on 03-04-2020 02:06 PM by ask_bill_brooks
Hi,
I have a CSV file(Pipe Delimited fields) with 3 columns. The value in the 2nd column has some new lines & quoted string in it as shown below,
"2020-02-23 11:15:39"|"
Hi Craig,
Please approve the standard pricing.
No further amendments made "Legal System."
Justification
-XXX is the sole owner in China
Thank you.
"|"Approved"
I am trying to read the file using "spark read csv" API, but it is not able to read/parse the file correctly. I am using spark 2.3.0 version. Below are the commands used,
val path = "/user/1234/abc.csv"
val inputDf = spark.read.option("delimiter","|").option("wholeFile",true).option("multiline",true).option("header",false).option("inferSchema",false).csv(path)
Could please help me out?
Note:- There is a word within quotes in the second field("Legal System.")
Thanks
Created 03-04-2020 08:34 PM
Any help on the above query?
Created 03-04-2020 09:30 PM
Hello @ravisro ,
I don't think there would be a straight forward way for this, (i.e) we might need to perform some sort of data cleansing work prior feeding it to spark in my view. Possibly the inputs shared does contains new line characters (\n) which might make spark to confuse with the data and new lines.
I did some sort of data cleansing, (i.e) removing newlines gave me below result
inputDf.show(false)
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------+ |_c0 |_c1 |_c2 | +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------+ |2020-02-23 11:15:39|"Hi Craig, Please approve the standard pricing. No further amendments made "Legal System."Justification -XXX is the sole owner in China Thank you."|Approved| +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------+
while running the same code from spark.
Created 04-01-2021 09:38 AM
I have similar issue, Is it possible to share data cleansing - removing newlines coding snippet.