Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive:CSV to Parquet conversion - timestamp

Hive:CSV to Parquet conversion - timestamp

Rising Star

So there is a flat text file, all attributes are separated by comma. One of the attribute (say "created") has timestamp value in the string format and it is of UTC timezone value.

The requirement is to convert from CSV to Parquet. This is being achieved by having one table based on CSV the other one based on Parquet format and the process runs on a server with EST timezone. However, the attribute "created" timestamp when read from Spark process (running on a UTC timezone server) increments by +4 hrs.

Can someone shed somelight on how this can be prevented in Hive while writing into Parquet?


Re: Hive:CSV to Parquet conversion - timestamp

Super Collaborator

Hi @Kumar,

you have an option to add the Timezone as separate column and when you are retrieving the data apply the following logic to extract the right format data,and store the data in unified format across all the systems (or just store in EPOCH format into integer data in UTC format while writing the data).

lets say tz is your timezone (ex EST)

hive> select * from  test_timestamp;
2017-08-22 15:05:03	EST

select to_utc_timestamp(ts,tz) from test_timestamp;
2017-08-22 20:05:03

again to get back what ever the timezone you want (example

select from_utc_timestamp(to_utc_timestamp(ts,tz),'EST') from test_timestamp;

Re: Hive:CSV to Parquet conversion - timestamp

Rising Star

What we observe is that the conversion to UTC on the fly doesn't help, since at the time of writing it always uses the EST (system timezone) into Parquet.

Don't have an account?
Coming from Hortonworks? Activate your account here