This issue has been a struggle for a long time (4+ years) with us and I am so interested to hear what other Cloudera customers are doing. I understand that Hive is meant to be the ETL batch SQL and Impala is really good at end user ad-hoc queries. Impala also excels at having Parquet-backed tables.
My querstion is, how do you write a TIMESTAMP datatype using Hive so that Impala will read the same value.
Here is how to reproduce the issue.
1) In Impala, execute this command:
CREATE TABLE default.ts_hive_test ( ts_data TIMESTAMP) STORED AS PARQUET;
2) In Hive, execute this command:
insert into default.ts_hive_test select from_unixtime(unix_timestamp());
3) In Hive, execute this command:
select * from default.ts_hive_test;
3) In Impala, execute this command:
select * from default.ts_hive_test;
The values are different because of a UTC timezone conversion issue going on with only Parquet backed tables.
So my question is, how do Cloudera Customers deal with this issue? How can you have TIMESTAMP datatypes in Impala if Hive cannot write to them?
There was an attempt to add a table property (https://issues.apache.org/jira/browse/HIVE-12767) but that was closed with a "Won't Fix" status. The final comment was "Hive already has a workaround based on a the writer metadata..." What is that workaround in the writer metadata?
This topic was briefly talked about here (http://community.cloudera.com/t5/Batch-SQL-Apache-Hive/Writing-Timestamp-columns-in-Parquet-Files-th...), but those Impala switches cause huge performance degredation.
Has anyone else faced this issue and come up with a solution? There has been a suggestion that we should change our hundreds of TIMESTAMP datatype columns into STRING datatype columns to avoid this issue, but is that the only option?