- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 10-03-2022 09:57 PM
After upgrading from CDH 6.1.1 to CDP 7.1.7, we saw that spark was handling parquet files with older timestamps in non-UTC cluster incorrectly.
Symptoms:
In older saved parquet files via CDH, when we were reading these files in CDP, it was changing the timestamps for older dates and adding below property in when checking via parquet tools
hadoop jar /opt/cloudera/parcels/CDH/jars/parquet-tools-1.10.99.7.1.7.0-551.jar meta
extra: writer.date.proleptic = false extra: writer.time.zone = Australia/Sydney
For us, 1985-02-01 onwards, the date was not changed and appeared normal for all outputs.
Another observation was when this parquet data was read by pyspark it was normal but it did not return correct values when running spark-submit.
Explanation:
There are two types of parquet readers and writers in CDP. One is Hive based and other are Spark native. Hive-supplied parquet readers and writers are relatively slow when compared to spark's native parquet readers and writers. Thus spark introduced spark.sql.hive.convertMetastoreParquet, when enabled, Spark will switch to spark's native parquet readers and writers for tables created using Hive table syntax. This property is true by default. For the write path i.e executing insert into, Spark can only switch to the native writer if the table is un-partitioned, while during the read path i.e. select * from, Spark can switch to native reader irrespective of whether the table is partitioned or not.
This mix and match of writer and reader are compatible in most cases, but when it comes to INT96 timestamps apparently that is not the case.
- Hive performs a specific set of rebasing for timestamps and spark might not necessarily perform the exact same set of rebasing. Thus the best solution here is to avoid this mix of hive parquet writer and spark parquet reader and use the same category of writer and reader.
Resolution:
1.) Use spark.sql.hive.convertMetastoreParquet=false. However this might give some relative performance degradations, but no need for upgrade and re-writing already processed data. You can use this solution in your current environment and this should give you the correct results.
2.) Upgrade to 7.1.7 SP1 and use spark.hadoop.hive.parquet.timestamp.write.legacy.conversion.enabled=true. The downside is that the old data needs to be re-written using the config value to maintain compatibility between Spark and Hive. This parameter is available in 7.1.7 SP1 but not present in 7.1.7