Here the first row is a comment and the row with ID 26 doesn't have ending columns values. Even it doesn't have \t at the end . So I need to read file skipping first line and handle missing delimiters at end.
I tried this
import org.apache.spark.sql.DataFrame val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val data = sc.textFile("FilePath")
val comments = data.first() val fdata = data.filter(x => x != comments) val c2=data.first() val fdata2 = fdata.filter(x => x != c2)
val header = fdata.filter(x => x.split("\t")(1) == "NUMBER").collect().mkString val df = fdata2 .map(x => x.split("\t")) .map(x => (x(0),x(1),x(2),x(3))) .toDF(header.split("\t"): _*).show
Since I have missing \t at the end of lines if empty, I am getting ArrayIndexoutofBoundsException.
because when converting the rdd to dataframe we have less records for some rows.
Please provide me a better solution so that I can skip first line and read the file correctly (even there are no \t the code needs to consider it as NULL values at the end like below)