Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Rising Star

After upgrading from CDH 6.1.1 to CDP 7.1.7, we saw that spark was handling parquet files with older timestamps in non-UTC cluster incorrectly.

 

Symptoms:

 

In older saved parquet files via CDH, when we were reading these files in CDP, it was changing the timestamps for older dates and adding below property in when checking via parquet tools

 

hadoop jar /opt/cloudera/parcels/CDH/jars/parquet-tools-1.10.99.7.1.7.0-551.jar meta 

 

extra:                         writer.date.proleptic = false
extra:                         writer.time.zone = Australia/Sydney

 

For us, 1985-02-01 onwards, the date was not changed and appeared normal for all outputs.

 

Another observation was when this parquet data was read by pyspark it was normal but it did not return correct values when running spark-submit.

 

Explanation:

 

 There are two types of parquet readers and writers in CDP. One is Hive based and other are Spark native. Hive-supplied parquet readers and writers are relatively slow when compared to spark's native parquet readers and writers. Thus spark introduced spark.sql.hive.convertMetastoreParquet, when enabled, Spark will switch to spark's native parquet readers and writers for tables created using Hive table syntax. This property is true by default. For the write path i.e executing insert into, Spark can only switch to the native writer if the table is un-partitioned, while during the read path i.e. select * from, Spark can switch to native reader irrespective of whether the table is partitioned or not.

 

This mix and match of writer and reader are compatible in most cases, but when it comes to INT96 timestamps apparently that is not the case.

  • Hive performs a specific set of rebasing for timestamps and spark might not necessarily perform the exact same set of rebasing. Thus the best solution here is to avoid this mix of hive parquet writer and spark parquet reader and use the same category of writer and reader.

Resolution:

 

1.) Use spark.sql.hive.convertMetastoreParquet=false. However this might give some relative performance degradations, but no need for upgrade and re-writing already processed data. You can use this solution in your current environment and this should give you the correct results.

 

2.) Upgrade to 7.1.7 SP1 and use spark.hadoop.hive.parquet.timestamp.write.legacy.conversion.enabled=true. The downside is that the old data needs to be re-written using the config value to maintain compatibility between Spark and Hive.  This parameter is available in 7.1.7 SP1 but not present in 7.1.7

770 Views
0 Kudos