So there is a flat text file, all attributes are separated by comma. One of the attribute (say "created") has timestamp value in the string format and it is of UTC timezone value.
The requirement is to convert from CSV to Parquet. This is being achieved by having one table based on CSV the other one based on Parquet format and the process runs on a server with EST timezone. However, the attribute "created" timestamp when read from Spark process (running on a UTC timezone server) increments by +4 hrs.
Can someone shed somelight on how this can be prevented in Hive while writing into Parquet?
you have an option to add the Timezone as separate column and when you are retrieving the data apply the following logic to extract the right format data,and store the data in unified format across all the systems (or just store in EPOCH format into integer data in UTC format while writing the data).
lets say tz is your timezone (ex EST)
hive> select * from test_timestamp; 2017-08-22 15:05:03 EST select to_utc_timestamp(ts,tz) from test_timestamp; 2017-08-22 20:05:03
again to get back what ever the timezone you want (example
select from_utc_timestamp(to_utc_timestamp(ts,tz),'EST') from test_timestamp;
What we observe is that the conversion to UTC on the fly doesn't help, since at the time of writing it always uses the EST (system timezone) into Parquet.