Created on 12-12-2016 10:11 AM - edited 09-16-2022 03:50 AM
Hi clouder people,
I am really surprised reading this documentation around parquet files & timestamp type & hive VS impala: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_timestamp.html. Above all for this two things:
I would like to ask your opinion considering the previous link about what is the best way to write a timestamp value (in a parquet file) throught HIVE or / and SPARK to be queried with HIVE and IMPALA and SPARK?
Regards,
Created 12-22-2016 05:32 AM
It would depend on your requirements some, but impala/hive and parquet don't store the time zone with the timestamp so you may be losing data if you don't have a seperate timezone column (ie what timezone the event took place), so I'll typically leave the timestamp as is in parquet, include the timezone, and allow convertions to be made at query time. If the users of the table need all timestamps to be in the same timezone, you could make the change before saving to parquet. You could also consider doing both, storing as is and also storing a converted timestamp manually.
Created 12-21-2016 07:03 AM
It depends on if you would prefer your timestamps to be converted to UTC (knowing the HiveServer would need to reside in the time zone the data was generated) or left alone. I personally prefer not modifying data automatically and control any time zone corrections within my ETL process. I also prefer to include a time zone column to be able to specify what time zone the data originated to be able to do any corrections later.
Just a note, Hive also has a feature to disable time zone conversion now as well, hive.parquet.timestamp.skip.conversion, that was added in 1.2 as part of https://issues.apache.org/jira/browse/HIVE-9482.
Finally, I'll note that both Hive and Impala try to be smart about when to apply the conversion when reading because multiple engines handle it differently so Hive will detect if it was written by a hive process or Impala and Impala will do the same. It makes sense to play around with the settings some inserting and reading from different settings to ensure you fully understand how it works.
Created 12-21-2016 07:51 AM
Thanks!
Only a question is coming my mind with your response;
you store the dates in string type with a new column added (time zone) and then you perform the transformations when you does the query or you store dates & timestamps in parquet timestamp type
--------
It depends on if you would prefer your timestamps to be converted to UTC (knowing the HiveServer would need to reside in the time zone the data was generated) or left alone. I personally prefer not modifying data automatically and control any time zone corrections within my ETL process. I also prefer to include a time zone column to be able to specify what time zone the data originated to be able to do any corrections later.
Created 12-22-2016 05:32 AM
It would depend on your requirements some, but impala/hive and parquet don't store the time zone with the timestamp so you may be losing data if you don't have a seperate timezone column (ie what timezone the event took place), so I'll typically leave the timestamp as is in parquet, include the timezone, and allow convertions to be made at query time. If the users of the table need all timestamps to be in the same timezone, you could make the change before saving to parquet. You could also consider doing both, storing as is and also storing a converted timestamp manually.
Created 07-14-2017 09:30 AM
we just learned a lesson that convert_legacy_hive_parquet_utc_timestamps=true may cause huge performance issues, read comments here
https://issues.apache.org/jira/browse/IMPALA-3316