Member since
03-31-2016
5
Posts
3
Kudos Received
0
Solutions
03-03-2021
09:51 PM
I have a tab separated file like this Copyright details 2021 ID \t NUMBER \t ADDRESS \t ZIPCODE 10 \t 9877777777 \t India \t 400322 13 \t 2983359852 \t AUS \t 84534 26 \t 7832724424 34 \t 8238444294 \t RSA \t 74363 Here the first row is a comment and the row with ID 26 doesn't have ending columns values. Even it doesn't have \t at the end . So I need to read file skipping first line and handle missing delimiters at end. I tried this import org.apache.spark.sql.DataFrame val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val data = sc.textFile("sample_badtsv.txt") val comments = data.first() val fdata = data.filter(x => x != comments) val header = fdata.filter(x => x.split("\t")(1) == "NUMBER").collect().mkString val df = fdata.filter(x => x.split("\t")(1) != "NUMBER") .map(x => x.split("\t")) .map(x => (x(0),x(1),x(2),x(3))) .toDF(header.split("\t"): _*) Since I have missing \t at the end of lines if empty, I am getting ArrayIndexoutofBoundsException. because when converting the rdd to dataframe we have less records for some rows. Please provide me a better solution so that I can skip first line and read the file correctly (even there are no \t the code needs to consider it as NULL values at the end like below) ID NUMBER ADDRESS ZIPCODE 10 9877777777 India 400322 13 2983359852 AUS 84534 26 7832724424 NULL NULL 34 8238444294 RSA 74363
... View more
04-26-2016
07:02 PM
The above 2 answers are very good. One caveat: keep in mind that when reading compressed file formats from disk, Spark partitioning depends on whether the format is splittable. For instance, these formats are splittable: bzip2, snappy, LZO (if indexed), while gzip is not splittable. Here is documentation about why: http://comphadoop.weebly.com/
... View more