Support Questions
Find answers, ask questions, and share your expertise

Load CSV File With Date Colum

I am facing issues while loading a CSV file that contains a date column; the whole column ends up being null.
Question: How can I specify the date format such that the data are loaded correctly?


I read the file (a text file from a Unix file system), apply the schema and write it out as Parquet.
(This step of writing the data out as Parquet does not matter, because the data frame has incorrect data after it is loaded.)
The problem is that the date column is not read correctly, and the data end up being null everywhere.

 

 

// This is how I read the file
   val df = spark.read
        .option("inferSchema", "false")
        .option("delimiter", tstFile.delim)
        .option("dateFormat", tstFile.dateFormat)
        .option("timestampFormat", tstFile.dateTimeFormat)
        .option("header", "true") // use the headers from the file
        .csv(tstFile.unixPath)
        .toDF()

	// Then I apply the schema
    val dfWithTypes = tstFile.withSchema(df)
	dfWithTypes.show()
    dfWithTypes.write.parquet(targetPath)

	// I use a simple data structure to provide metadata about this type of file
  case class TestFile(unixPath: String, delim: String, newName: String) extends TstFile {
    override val dateFormat: String = "MM/dd/yyyy"
    override val dateTimeFormat: String = "MM/dd/yyyy"
    override val headers = Seq.empty[String] // file has headers

	// This is how I apply the schema
    override def withSchema(df: DataFrame): DataFrame = {
      df.select(
        df("Column1").cast(IntegerType),
        df("Column2").cast(StringType),
        df("Column3").cast(DateType))
    }
  }

 

Input data

 

cat TestFile.txt

Column1 Column2 Column3
100     One     05/01/2019

 

 

This is how the data frame is printed by doing df.show()

 

 

+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
|    100|    One|   null|
+-------+-------+-------+

 

 

0 REPLIES 0