Reply
Explorer
Posts: 19
Registered: ‎10-15-2014

Scala to write timestamp to Parquet that Impala reads without changing the value

We are trying to figure out the Spark Scala commands to write a timestamp value to Parquet that doesn't change when Impala trys to read it from an external table.  Spark is behaving like Hive where it writes the timestamp value in the local time zone, which is what we are trying to avoid.  Can you provide some assistance?

 

Here is what we tried:

 

scala> val df1= Seq("1", "2016-10-01").zipWithIndex.map(_.swap).toDF("id", "bday")

df1: org.apache.spark.sql.DataFrame = [id: int, bday: string]

 

 

 

scala> import org.apache.spark.sql.types._

import org.apache.spark.sql.types._

 

Cast 2016-10-01 as timestamp

 

scala> val df2 = df1.withColumn("bday",df1("bday").cast(TimestampType))

df2: org.apache.spark.sql.DataFrame = [id: int, bday: timestamp]

 

scala> df2.show()

+---+--------------------+

| id|                bday|

+---+--------------------+

|  0|                null|

|  1|2016-10-01 00:00:...|

+---+--------------------+

 

Write the dataframe to parquet

 

scala> df2.select("id", "bday").write.save("/user/hive/warehouse/edw.db/scalatable/test2.parquet")

 

When I define an external table for test2.parqquet and select thru impala, I get 2016-10-01 04:00:00. How is that possible? The data on disk is clearly 2016-10-01 00:00.

Contributor
Posts: 35
Registered: ‎04-07-2016

Re: Scala to write timestamp to Parquet that Impala reads without changing the value

[ Edited ]

Hi,

look at those two flags : 

-convert_legacy_hive_parquet_utc_timestamps=true
-use_local_tz_for_unix_timestamp_conversions=true

 

Long story short, impala and hive don't treat the dates the same way even if the documentation say so. I don't remember exactly which one is doing what, but one of them asume that all the dates are in "GMT" the other look at you server timezone.

Explorer
Posts: 19
Registered: ‎10-15-2014

Re: Scala to write timestamp to Parquet that Impala reads without changing the value

Hi Maurin,

 

Yes, that is the tricky part.  If you insert into a Parquet table with Hive, then the timestamp value is different when you read it out with Impala without these flags.  The flags functionlly work OK.  However, we tried those flags and on our environment it caused a 2x - 3x performance drop in the queries so we abandoned that idea.

 

That is why we are trying to find a way in Spark to write a timestamp field into Parquet as a UTC time zone (the default and only option for timestaps in Parquet).

 

Thanks,

Brett

Highlighted
Contributor
Posts: 35
Registered: ‎04-07-2016

Re: Scala to write timestamp to Parquet that Impala reads without changing the value

Hi,

 

In my tests, any date related function in impala have huge costs (from_unixtime especialy)... 

If you find any better way of doing it properly tell me :)

 

thanks

Announcements