Member since 
    
	
		
		
		03-03-2021
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                3
            
            
                Posts
            
        
                0
            
            
                Kudos Received
            
        
                0
            
            
                Solutions
            
        
			
    
	
		
		
		03-03-2021
	
		
		09:51 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I have a tab separated file like this     Copyright details 2021  ID \t NUMBER \t ADDRESS \t ZIPCODE  10 \t 9877777777 \t India \t 400322  13 \t 2983359852 \t AUS \t 84534  26 \t 7832724424  34 \t 8238444294 \t RSA \t 74363    Here the first row is a comment and the row with ID 26 doesn't have ending columns values. Even it doesn't have \t at the end .  So I need to read file skipping first line and handle missing delimiters at end.  I tried this  import org.apache.spark.sql.DataFrame  val sqlContext = new org.apache.spark.sql.SQLContext(sc)  import sqlContext.implicits._  val data = sc.textFile("sample_badtsv.txt")  val comments = data.first()  val fdata = data.filter(x => x != comments)    val header = fdata.filter(x => x.split("\t")(1) == "NUMBER").collect().mkString  val df = fdata.filter(x => x.split("\t")(1) != "NUMBER")  .map(x => x.split("\t"))  .map(x => (x(0),x(1),x(2),x(3)))  .toDF(header.split("\t"): _*)        Since I have missing \t at the end of lines if empty, I am getting ArrayIndexoutofBoundsException.  because when converting the rdd to dataframe we have less records for some rows.     Please provide me a better solution so that I can skip first line and read the file correctly (even there are no \t the code needs to consider it as NULL values at the end like below)     ID NUMBER ADDRESS ZIPCODE  10 9877777777 India 400322  13 2983359852 AUS 84534  26 7832724424 NULL NULL  34 8238444294 RSA 74363    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        
