We are trying to figure out the Spark Scala commands to write a timestamp value to Parquet that doesn't change when Impala trys to read it from an external table. Spark is behaving like Hive where it writes the timestamp value in the local time zone, which is what we are trying to avoid. Can you provide some assistance?
Here is what we tried:
scala> val df1= Seq("1", "2016-10-01").zipWithIndex.map(_.swap).toDF("id", "bday")
df1: org.apache.spark.sql.DataFrame = [id: int, bday: string]
scala> import org.apache.spark.sql.types._
Cast 2016-10-01 as timestamp
scala> val df2 = df1.withColumn("bday",df1("bday").cast(TimestampType))
df2: org.apache.spark.sql.DataFrame = [id: int, bday: timestamp]
| id| bday|
| 0| null|
| 1|2016-10-01 00:00:...|
Write the dataframe to parquet
scala> df2.select("id", "bday").write.save("/user/hive/warehouse/edw.db/scalatable/test2.parquet")
When I define an external table for test2.parqquet and select thru impala, I get 2016-10-01 04:00:00. How is that possible? The data on disk is clearly 2016-10-01 00:00.
look at those two flags :
Long story short, impala and hive don't treat the dates the same way even if the documentation say so. I don't remember exactly which one is doing what, but one of them asume that all the dates are in "GMT" the other look at you server timezone.
Yes, that is the tricky part. If you insert into a Parquet table with Hive, then the timestamp value is different when you read it out with Impala without these flags. The flags functionlly work OK. However, we tried those flags and on our environment it caused a 2x - 3x performance drop in the queries so we abandoned that idea.
That is why we are trying to find a way in Spark to write a timestamp field into Parquet as a UTC time zone (the default and only option for timestaps in Parquet).
In my tests, any date related function in impala have huge costs (from_unixtime especialy)...
If you find any better way of doing it properly tell me :)