Member since
08-28-2018
2
Posts
0
Kudos Received
0
Solutions
12-20-2018
12:11 PM
From the Hive documentation for hive.parquet.timestamp.skip.conversion: "Current Hive implementation of Parquet stores timestamps in UTC on-file, this flag allows skipping of the conversion on reading Parquet files created from other tools that may not have done so." Note this is only on reading Parquet files; conversion to UTC still occurs when writing Parquet files. A workaround if you really want to skip conversion is to set the JVM timezone to UTC. Then Hive will think the local timezone is UTC. You can do this by adding "-Duser.timezone=UTC" to Java Configuration Options for HiveServer2 in Cloudera Manager. WARNING: When using this option, if you have users writing to a database from different timezones, that won't be taken into account resulting in incorrect timestamps (this is the original point of conversion to UTC - to standardize between timezones). Essentially, you'll have fixed the Hive/Impala incompatibility at the cost of recreating the original timezone incompatibility. Furthermore, the above change is on HiveServer2, so it won't affect users on the deprecated Hive CLI (which bypasses HS2) or running local Spark. There may also be other unforeseen environments which will bypass this setting. Thus, if you want a magic bullet solution to the Hive/Impala timezone incompatibility, your best bet is to set the Impala flags "--use_local_tz_for_unix_timestamp_conversions=true" and "--convert_legacy_hive_parquet_utc_timestamps=true" despite the performance hit (which is fixed in CDH 6.1). Alternatively, you can manually convert to UTC whenever timestamps are written in Impala. This may be viable if you have a small number of tables which use timestamps and performance is critical. Sources: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.parquet.timestamp.skip.conversion https://www.cloudera.com/documentation/enterprise/5-15-x/topics/impala_timestamp.html
... View more