Support Questions
Find answers, ask questions, and share your expertise

Removing first line from CSV file and handle missing delimiters at the end of lines if column values are empty for it through spark

New Contributor


I have a tab separated file like this

 

Copyright details 2021

ID \t NUMBER \t ADDRESS \t ZIPCODE

10 \t 9877777777 \t India \t 400322

13 \t 2983359852 \t AUS \t 84534

26 \t 7832724424

34 \t 8238444294 \t RSA \t 74363


Here the first row is a comment and the row with ID 26 doesn't have ending columns values. Even it doesn't have \t at the end .
So I need to read file skipping first line and handle missing delimiters at end.

I tried this

import org.apache.spark.sql.DataFrame
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val data = sc.textFile("FilePath")

val comments = data.first()
val fdata = data.filter(x => x != comments)
val c2=data.first()
val fdata2 = fdata.filter(x => x != c2)

val header = fdata.filter(x => x.split("\t")(1) == "NUMBER").collect().mkString
val df = fdata2
                   .map(x => x.split("\t"))
                   .map(x => (x(0),x(1),x(2),x(3)))
                   .toDF(header.split("\t"): _*).show

 

Since I have missing \t at the end of lines if empty, I am getting ArrayIndexoutofBoundsException.

because when converting the rdd to dataframe we have less records for some rows.

 

Please provide me a better solution so that I can skip first line and read the file correctly (even there are no \t the code needs to consider it as NULL values at the end like below)

 

ID NUMBER ADDRESS ZIPCODE

10 9877777777 India 400322

13 2983359852 AUS 84534

26 7832724424 NULL NULL

34 8238444294 RSA 74363

0 REPLIES 0