Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark CSV import getting failed due to presence of new line character in the cell

Spark CSV import getting failed due to presence of new line character in the cell

New Contributor

We tried to form a dataframe in spark by loading a CSV file. During the load process, we are getting the below error, (startline 1) EOF reached before encapsulated token finished atorg.apache.commons.csv.Lexer.parseEncapsulatedToken( at org.apache.commons.csv.Lexer.nextToken( at org.apache.commons.csv.CSVParser.nextRecord( at org.apache.commons.csv.CSVParser.getRecords( at com.databricks.spark.csv.CsvRelation$anonfun$com$databricks$spark$csv$CsvRelation$parseCSV$1.apply(CsvRelation.scala:206) at com.databricks.spark.csv.CsvRelation$anonfun$com$databricks$spark$csv$CsvRelation$parseCSV$1.apply(CsvRelation.scala:204) at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

We understood the issue is because of one of the cell content(middle of a row)in the csv , contains a new line character (CRLF).

Note : The cell content is properly surrounded by double quotes and also we have specified delimiter as double quote when we try to load through the SQLContext. Still it fails :)

Any idea or workaround to solve this issue ?

Thanks in advance..