Member since
11-29-2016
23
Posts
2
Kudos Received
0
Solutions
12-20-2018
12:11 PM
From the Hive documentation for hive.parquet.timestamp.skip.conversion: "Current Hive implementation of Parquet stores timestamps in UTC on-file, this flag allows skipping of the conversion on reading Parquet files created from other tools that may not have done so." Note this is only on reading Parquet files; conversion to UTC still occurs when writing Parquet files. A workaround if you really want to skip conversion is to set the JVM timezone to UTC. Then Hive will think the local timezone is UTC. You can do this by adding "-Duser.timezone=UTC" to Java Configuration Options for HiveServer2 in Cloudera Manager. WARNING: When using this option, if you have users writing to a database from different timezones, that won't be taken into account resulting in incorrect timestamps (this is the original point of conversion to UTC - to standardize between timezones). Essentially, you'll have fixed the Hive/Impala incompatibility at the cost of recreating the original timezone incompatibility. Furthermore, the above change is on HiveServer2, so it won't affect users on the deprecated Hive CLI (which bypasses HS2) or running local Spark. There may also be other unforeseen environments which will bypass this setting. Thus, if you want a magic bullet solution to the Hive/Impala timezone incompatibility, your best bet is to set the Impala flags "--use_local_tz_for_unix_timestamp_conversions=true" and "--convert_legacy_hive_parquet_utc_timestamps=true" despite the performance hit (which is fixed in CDH 6.1). Alternatively, you can manually convert to UTC whenever timestamps are written in Impala. This may be viable if you have a small number of tables which use timestamps and performance is critical. Sources: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.parquet.timestamp.skip.conversion https://www.cloudera.com/documentation/enterprise/5-15-x/topics/impala_timestamp.html
... View more
07-02-2017
07:57 PM
I have been re-run the test, and kudu perform much better this time(though it's still a little bit slower than parquet), thanks for @mpercy's suggestion. I changed two things by re-runing the test: 1, increase the partitions for the fact table from 60 to 768(affact all queries) 2, change the query3.sql 'or' predicate into 'in' predicate, so predicate can push down to kudu(only affact query 3) below is the re-run result: (column 'kudu60' is the previous result, which means the partitions of fact table is 60 ) (column 'kudu768' is the new result, which means the partitions of fact table is 768)
... View more
04-11-2017
06:36 PM
Got it, thanks a lot.
... View more