Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Writing Timestamp columns in Parquet Files through HIVE / SPARK

avatar
Explorer

Hi clouder people,

 

I am really surprised reading this documentation around parquet files & timestamp type & hive VS impala: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_timestamp.html. Above all for this two things:

 

  1. When the table uses Parquet format, Impala expects any time zone adjustment to be applied prior to writing, while TIMESTAMP values written by Hive are adjusted to be in the UTC time zone. When Hive queries Parquet data files that it wrote, it adjusts the TIMESTAMP values back to the local time zone, while Impala does no conversion. Hive does no time zone conversion when it queries Impala-written Parquet files. 
  2. If you have data files written by Hive, those TIMESTAMP values represent the local timezone of the host where the data was written, potentially leading to inconsistent results when processed by Impala. To avoid compatibility problems or having to code workarounds, you can specify one or both of these impaladstartup flags: -use_local_tz_for_unix_timestamp_conversions=true -convert_legacy_hive_parquet_utc_timestamps=true. Although -convert_legacy_hive_parquet_utc_timestamps is turned off by default to avoid performance overhead, Cloudera recommends turning it on when processing TIMESTAMP columns in Parquet files written by Hive, to avoid unexpected behavior.

I would like to ask your opinion considering the previous link about what is the best way to write a timestamp value (in a parquet file) throught HIVE or / and SPARK to be queried with HIVE and IMPALA and SPARK?

 

  1. Writing the file using HIVE or / and SPARK and suffering the derivated performance problem of setting this two properties
    1. -use_local_tz_for_unix_timestamp_conversions=true
    2. -convert_legacy_hive_parquet_utc_timestamps=true.
  2. Writing the file using IMPALA (preparing the table with HIVE or SPARK previously to avoid complex queries in Impala), due to:
    1. Hive does no time zone conversion when it queries Impala-written Parquet files.
  3. Another one?

Regards,

1 ACCEPTED SOLUTION

avatar
Expert Contributor

It would depend on your requirements some, but impala/hive and parquet don't store the time zone with the timestamp so you may be losing data if you don't have a seperate timezone column (ie what timezone the event took place), so I'll typically leave the timestamp as is in parquet, include the timezone, and allow convertions to be made at query time.  If the users of the table need all timestamps to be in the same timezone, you could make the change before saving to parquet.  You could also consider doing both, storing as is and also storing a converted timestamp manually.

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

It depends on if you would prefer your timestamps to be converted to UTC (knowing the HiveServer would need to reside in the time zone the data was generated) or left alone.  I personally prefer not modifying data automatically and control any time zone corrections within my ETL process.  I also prefer to include a time zone column to be able to specify what time zone the data originated to be able to do any corrections later.

 

Just a note, Hive also has a feature to disable time zone conversion now as well, hive.parquet.timestamp.skip.conversion, that was added in 1.2 as part of https://issues.apache.org/jira/browse/HIVE-9482.

 

Finally, I'll note that both Hive and Impala try to be smart about when to apply the conversion when reading because multiple engines handle it differently so Hive will detect if it was written by a hive process or Impala and Impala will do the same.  It makes sense to play around with the settings some inserting and reading from different settings to ensure you fully understand how it works.

avatar
Explorer

Thanks!


Only a question is coming my mind with your response;

you store the dates in string type with a new column added (time zone) and then you perform the transformations when you does the query or you store dates & timestamps in parquet timestamp type

 

--------

 

It depends on if you would prefer your timestamps to be converted to UTC (knowing the HiveServer would need to reside in the time zone the data was generated) or left alone.  I personally prefer not modifying data automatically and control any time zone corrections within my ETL process.  I also prefer to include a time zone column to be able to specify what time zone the data originated to be able to do any corrections later.

 

avatar
Expert Contributor

It would depend on your requirements some, but impala/hive and parquet don't store the time zone with the timestamp so you may be losing data if you don't have a seperate timezone column (ie what timezone the event took place), so I'll typically leave the timestamp as is in parquet, include the timezone, and allow convertions to be made at query time.  If the users of the table need all timestamps to be in the same timezone, you could make the change before saving to parquet.  You could also consider doing both, storing as is and also storing a converted timestamp manually.

avatar
Explorer

we just learned a lesson that convert_legacy_hive_parquet_utc_timestamps=true may cause huge performance issues, read comments here
https://issues.apache.org/jira/browse/IMPALA-3316