Support Questions

Find answers, ask questions, and share your expertise
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!

Reading CSV File Spark - Issue with Backslash

avatar
New Contributor

I'm facing weird issue, not sure why Spark is behaving like this.

samplefile.txt:

COL1|COL2|COL3|COL4 
"1st Data"|"2nd ""\\\\P"" data"|"3rd data"|"4th data"

 

This is my spark code to read data:

val df = spark.read.format("csv").option("header","true").option("inferSchema","true").option("delimiter","|").load("\samplefile.xtx")
df.show(false)

Some how it is combining 2 columns data into one. Spark Scala : 2.4 Version

Any idea why spark is behaving like this.

IMG-6762.JPG

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hi @ShobhitSingh 

 

You need to adjust the csv file

 

sample.csv
=========

 

COL1|COL2|COL3|COL4
1st Data|2nd|3rd data|4th data
1st Data|2nd \\P data|3rd data|4th data
"1st Data"|"2nd '\\P' data"|"3rd data"|"4th data"
"1st Data"|"2nd '\\\\P' data"|"3rd data"|"4th data"

 

Spark Code:

 

 

spark.read.format("csv").option("header","true").option("inferSchema","true").option("delimiter","|").load("/tmp/sample.csv").show(false)

 

Output:

+--------+--------------+----------+--------+
|COL1 |COL2 |COL3 |COL4 |
+--------+--------------+----------+--------+
|1st Data|2nd |3rd data |4th data|
|1st Data|2nd \\P data |3rd data |4th data|
|1st Data|2nd '\P' data |3rd data |4th data|
|1st Data|2nd '\\P' data|3rd data |4th data|
+--------+--------------+----------+--------+

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

@ShobhitSingh You need to handle the escape with another option:

 

.option("escape", "\\")

You may need to experiment with the actual string in the match argument ("//") to suit your needs.  Be sure to check spark docs specific to your version.  For example:

 

https://spark.apache.org/docs/latest/sql-data-sources-csv.html

 

 

 

 

avatar
New Contributor

Hi Steven,

Even if my data is like this, its causing issue.

"1st Data"|"2nd ""\P"" data"|"3rd data"|"4th data"

What is causing issue? Any Idea.

I know spark is having default escape as backslash. But why it is behaving like this.

avatar
Super Collaborator

Click into that doc and check out the other escape option.  I think you need to handle the quotes too.

avatar
Expert Contributor

Hi @ShobhitSingh 

 

You need to adjust the csv file

 

sample.csv
=========

 

COL1|COL2|COL3|COL4
1st Data|2nd|3rd data|4th data
1st Data|2nd \\P data|3rd data|4th data
"1st Data"|"2nd '\\P' data"|"3rd data"|"4th data"
"1st Data"|"2nd '\\\\P' data"|"3rd data"|"4th data"

 

Spark Code:

 

 

spark.read.format("csv").option("header","true").option("inferSchema","true").option("delimiter","|").load("/tmp/sample.csv").show(false)

 

Output:

+--------+--------------+----------+--------+
|COL1 |COL2 |COL3 |COL4 |
+--------+--------------+----------+--------+
|1st Data|2nd |3rd data |4th data|
|1st Data|2nd \\P data |3rd data |4th data|
|1st Data|2nd '\P' data |3rd data |4th data|
|1st Data|2nd '\\P' data|3rd data |4th data|
+--------+--------------+----------+--------+
Labels