Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Escaping double quotes in spark dataframe

avatar

I am reading a csv file into a spark dataframe. i have the double quotes ("") in some of the fields and i want to escape it. can anyone let me know how can i do this?. since double quotes is used in the parameter list for options method, i dont know how to escape double quotes in the data

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|"). option("escape", -----

3 REPLIES 3

avatar

Assume you have a file /tmp/test.csv" like

Col1|Col2|Col3|Col4
12|34|"56|78"|9A
"AB"|"CD"|EF|"GH:"|:"IJ"

If I load it with Spark I get

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
                   .option("delimiter", "|").option("escape", ":").load("/tmp/test.csv")
df.show()
+----+----+-----+-------+
|Col1|Col2| Col3|   Col4|
+----+----+-----+-------+
|  12|  34|56|78|     9A|
|  AB|  CD|   EF|GH"|"IJ|
+----+----+-----+-------+

So the example contains delimiter in quotes and escaped quotes. I use ":" to escape quotes, you can many other characters (don't use e.g. "#")

Is this something you want to achieve?

avatar

@Bernhard Walter, my requirement is similar. File with data like

Col1|Col2|Col3|Col4

"AB"|"CD"|"DE"| "EF"

and i need an output like after loading into a dataframe

Col1|Col2|Col3|Col4

AB|CD|DE|EF

I dont see your suggestion working. How will escaping : escape doble quotes

avatar

There is an issue with the space in front of "EF":

Let's use (you don't need the "escape" option, it can be used to e.g. get quotes into the dataframe if needed)

val df = sqlContext.read.format("com.databricks.spark.csv")
          .option("header", "true")
          .option("delimiter", "|")
          .load("/tmp/test.csv")
df.show()

With space in front of "EF"

+----+----+----+-----+
|Col1|Col2|Col3| Col4|
+----+----+----+-----+
|  AB|  CD|  DE| "EF"|
+----+----+----+-----+

Without space in front of "EF":

+----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
|  AB|  CD|  DE|  EF|
+----+----+----+----+

Can you remove the space before loading the csv into Spark?