Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎03-09-2016

Spark CSV import getting failed due to presence of new line character in the cell

We tried to form a dataframe in spark by loading a CSV file. During the load process, we are getting the below error,

java.io.IOException: (startline 1) EOF reached before encapsulated token finished atorg.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282) at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152) at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498) at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365) at com.databricks.spark.csv.CsvRelation$anonfun$com$databricks$spark$csv$CsvRelation$parseCSV$1.apply(CsvRelation.scala:206) at com.databricks.spark.csv.CsvRelation$anonfun$com$databricks$spark$csv$CsvRelation$parseCSV$1.apply(CsvRelation.scala:204) at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

We understood the issue is because of one of the cell content(middle of a row)in the csv , contains a new line character (CRLF).

Note : The cell content is properly surrounded by double quotes and also we have specified delimiter as double quote when we try to load through the SQLContext. Still it fails :)

Any idea or workaround to solve this issue ?

Thanks in advance..

Announcements