- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Remove newlines in a quoted string with spark
- Labels:
-
Apache Spark
Created 11-10-2017 07:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to load a csv into spark but having difficulty with some newline characters in quotes. e.g
"The csv file is about to be loaded into Phoenix" How i want it: "The csv file is about to be loaded into Phoenix"
How do i get around this?
Created 11-10-2017 09:13 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your text always starts,ends with a ", then you can probably use below transformations:
text.map(lambda x:(1,x)).reduceByKey(lambda x,y:' '.join([x,y])).map(lambda x:x[1][1:-2]).flatMap(lambda x:x.split('" "')).collect()
where text represents an object that reads below lines
"The csv
file is about
to be loaded into
Phoenix"
"another line
to parse"
like:
['"The csv','file is about','to be loaded into','Phoenix",'"another line','to parse"']
While loading lines are split on a \n. This reduces them once again to a single line and splits on " ", so you get a list with portions between successive ".
