we are running cdh 5.9.0(impala 2.7.0, hive 1.1.0).
we know that, while quering timestamp fields(parquet table generated by hive) with impala, we may get different result(vs hive) because of the timezone reason. the startup args of our impala is:
what we confuse is that, whatever true/false we set hive.parquet.timestamp.skip.conversion while generating different parquet table in hive, we got the same timestamp result while doing query with impala from both of the generated table. what we expect is that, when the value of hive.parquet.timestamp.skip.conversion is different, the result should be different. but it just don't perform in this way.
we are really confuse about this, any reply will be appreciate.
bellowing steps is a test:
CREATE TABLE test_timestamp (ts TIMESTAMP) STORED AS TEXTFILE;
CREATE TABLE test_ts_skip_conversion_true_parquet (ts TIMESTAMP) STORED AS TEXTFILE;
CREATE TABLE test_ts_skip_conversion_false_parquet (ts TIMESTAMP) STORED AS TEXTFILE;
step1, load data into and query from test_timestamp
step2, select data into test_ts_skip_conversion_true_parquet
step3, select data into test_ts_skip_conversion_false_parquet
step4, while query test_ts_skip_conversion_true_parquet and test_ts_skip_conversion_false_parquet with impala, we got the same result, but we expect different result here!
From the Hive documentation for hive.parquet.timestamp.skip.conversion:
"Current Hive implementation of Parquet stores timestamps in UTC on-file, this flag allows skipping of the conversion on reading Parquet files created from other tools that may not have done so."
Note this is only on reading Parquet files; conversion to UTC still occurs when writing Parquet files.
A workaround if you really want to skip conversion is to set the JVM timezone to UTC. Then Hive will think the local timezone is UTC. You can do this by adding "-Duser.timezone=UTC" to Java Configuration Options for HiveServer2 in Cloudera Manager.
When using this option, if you have users writing to a database from different timezones, that won't be taken into account resulting in incorrect timestamps (this is the original point of conversion to UTC - to standardize between timezones). Essentially, you'll have fixed the Hive/Impala incompatibility at the cost of recreating the original timezone incompatibility.
Furthermore, the above change is on HiveServer2, so it won't affect users on the deprecated Hive CLI (which bypasses HS2) or running local Spark. There may also be other unforeseen environments which will bypass this setting.
Thus, if you want a magic bullet solution to the Hive/Impala timezone incompatibility, your best bet is to set the Impala flags "--use_local_tz_for_unix_timestamp_conversions=true" and
"--convert_legacy_hive_parquet_utc_timestamps=true" despite the performance hit (which is fixed in CDH 6.1).
Alternatively, you can manually convert to UTC whenever timestamps are written in Impala. This may be viable if you have a small number of tables which use timestamps and performance is critical.