Support Questions

Find answers, ask questions, and share your expertise

Remove newlines in a quoted string with spark

avatar
Expert Contributor

I am trying to load a csv into spark but having difficulty with some newline characters in quotes. e.g

"The csv 
file is about 
to be loaded into 
Phoenix"

How i want it:
"The csv file is about to be loaded into Phoenix"

How do i get around this?

1 REPLY 1

avatar
Super Collaborator

If your text always starts,ends with a ", then you can probably use below transformations:

text.map(lambda x:(1,x)).reduceByKey(lambda x,y:' '.join([x,y])).map(lambda x:x[1][1:-2]).flatMap(lambda x:x.split('" "')).collect()

where text represents an object that reads below lines

"The csv

file is about

to be loaded into

Phoenix"

"another line

to parse"

like:

['"The csv','file is about','to be loaded into','Phoenix",'"another line','to parse"']

While loading lines are split on a \n. This reduces them once again to a single line and splits on " ", so you get a list with portions between successive ".