Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Problem in reading CSV file using Apache Spark

Highlighted

Problem in reading CSV file using Apache Spark

New Contributor

Hi,

I have a CSV file(Pipe Delimited fields) with 3 columns. The value in the 2nd column has some new lines & quoted string in it as shown below,

 

"2020-02-23 11:15:39"|"

Hi Craig,

Please approve the standard pricing.

 

No further amendments made "Legal System."

 

Justification

-XXX is the sole owner in China

Thank you.

"|"Approved"

 

I am trying to read the file using "spark read csv" API, but it is not able to read/parse the file correctly. I am using spark 2.3.0 version. Below are the commands used,

 

val path = "/user/1234/abc.csv"
val inputDf = spark.read.option("delimiter","|").option("wholeFile",true).option("multiline",true).option("header",false).option("inferSchema",false).csv(path)

 

 

Could please help me out?

Note:- There is a word within quotes in the second field("Legal System.")

 

Thanks

2 REPLIES 2
Highlighted

Re: Problem in reading CSV file using Apache Spark

New Contributor

Any help on the above query?

Re: Problem in reading CSV file using Apache Spark

Expert Contributor

Hello @ravisro ,

 

I don't think there would be a straight forward way for this, (i.e) we might need to perform some sort of data cleansing work prior feeding it to spark in my view. Possibly the inputs shared does contains new line characters (\n) which might make spark to confuse with the data and new lines. 

 

I did some sort of data cleansing, (i.e) removing newlines gave me below result

 

inputDf.show(false)
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|_c0                |_c1                                                                                                                                                |_c2     |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|2020-02-23 11:15:39|"Hi Craig, Please approve the standard pricing. No further amendments made "Legal System."Justification -XXX is the sole owner in China Thank you."|Approved|
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+--------+

 while running the same code from spark.

Thanks,
Satz
Don't have an account?
Coming from Hortonworks? Activate your account here