Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Scala to write timestamp to Parquet that Impala reads without changing the value

Scala to write timestamp to Parquet that Impala reads without changing the value

Explorer

We are trying to figure out the Spark Scala commands to write a timestamp value to Parquet that doesn't change when Impala trys to read it from an external table.  Spark is behaving like Hive where it writes the timestamp value in the local time zone, which is what we are trying to avoid.  Can you provide some assistance?

 

Here is what we tried:

 

scala> val df1= Seq("1", "2016-10-01").zipWithIndex.map(_.swap).toDF("id", "bday")

df1: org.apache.spark.sql.DataFrame = [id: int, bday: string]

 

 

 

scala> import org.apache.spark.sql.types._

import org.apache.spark.sql.types._

 

Cast 2016-10-01 as timestamp

 

scala> val df2 = df1.withColumn("bday",df1("bday").cast(TimestampType))

df2: org.apache.spark.sql.DataFrame = [id: int, bday: timestamp]

 

scala> df2.show()

+---+--------------------+

| id|                bday|

+---+--------------------+

|  0|                null|

|  1|2016-10-01 00:00:...|

+---+--------------------+

 

Write the dataframe to parquet

 

scala> df2.select("id", "bday").write.save("/user/hive/warehouse/edw.db/scalatable/test2.parquet")

 

When I define an external table for test2.parqquet and select thru impala, I get 2016-10-01 04:00:00. How is that possible? The data on disk is clearly 2016-10-01 00:00.

3 REPLIES 3

Re: Scala to write timestamp to Parquet that Impala reads without changing the value

Contributor

Hi,

look at those two flags : 

-convert_legacy_hive_parquet_utc_timestamps=true
-use_local_tz_for_unix_timestamp_conversions=true

 

Long story short, impala and hive don't treat the dates the same way even if the documentation say so. I don't remember exactly which one is doing what, but one of them asume that all the dates are in "GMT" the other look at you server timezone.

Re: Scala to write timestamp to Parquet that Impala reads without changing the value

Explorer

Hi Maurin,

 

Yes, that is the tricky part.  If you insert into a Parquet table with Hive, then the timestamp value is different when you read it out with Impala without these flags.  The flags functionlly work OK.  However, we tried those flags and on our environment it caused a 2x - 3x performance drop in the queries so we abandoned that idea.

 

That is why we are trying to find a way in Spark to write a timestamp field into Parquet as a UTC time zone (the default and only option for timestaps in Parquet).

 

Thanks,

Brett

Highlighted

Re: Scala to write timestamp to Parquet that Impala reads without changing the value

Contributor

Hi,

 

In my tests, any date related function in impala have huge costs (from_unixtime especialy)... 

If you find any better way of doing it properly tell me :)

 

thanks