Community Articles

usman35 · ‎10-03-2022

After upgrading from CDH 6.1.1 to CDP 7.1.7, we saw that spark was handling parquet files with older timestamps in non-UTC cluster incorrectly.

Symptoms:

In older saved parquet files via CDH, when we were reading these files in CDP, it was changing the timestamps for older dates and adding below property in when checking via parquet tools

hadoop jar /opt/cloudera/parcels/CDH/jars/parquet-tools-1.10.99.7.1.7.0-551.jar meta

extra:                         writer.date.proleptic = false
extra:                         writer.time.zone = Australia/Sydney

For us, 1985-02-01 onwards, the date was not changed and appeared normal for all outputs.

Another observation was when this parquet data was read by pyspark it was normal but it did not return correct values when running spark-submit.

Explanation:

There are two types of parquet readers and writers in CDP. One is Hive based and other are Spark native. Hive-supplied parquet readers and writers are relatively slow when compared to spark's native parquet readers and writers. Thus spark introduced spark.sql.hive.convertMetastoreParquet, when enabled, Spark will switch to spark's native parquet readers and writers for tables created using Hive table syntax. This property is true by default. For the write path i.e executing insert into, Spark can only switch to the native writer if the table is un-partitioned, while during the read path i.e. select * from, Spark can switch to native reader irrespective of whether the table is partitioned or not.

This mix and match of writer and reader are compatible in most cases, but when it comes to INT96 timestamps apparently that is not the case.

Hive performs a specific set of rebasing for timestamps and spark might not necessarily perform the exact same set of rebasing. Thus the best solution here is to avoid this mix of hive parquet writer and spark parquet reader and use the same category of writer and reader.

Resolution:

1.) Use spark.sql.hive.convertMetastoreParquet=false. However this might give some relative performance degradations, but no need for upgrade and re-writing already processed data. You can use this solution in your current environment and this should give you the correct results.

2.) Upgrade to 7.1.7 SP1 and use spark.hadoop.hive.parquet.timestamp.write.legacy.conversion.enabled=true. The downside is that the old data needs to be re-written using the config value to maintain compatibility between Spark and Hive. This parameter is available in 7.1.7 SP1 but not present in 7.1.7

Cloudera Community

Community Articles

Parquet timestamp handling for non UTC data in CDP 7.1.7 cluster.

Apache Hive

Apache Spark

Cloudera Data Platform (CDP)

Cloudera Data Platform Private Cloud (CDP-Private)

Cloudera Enterprise Data Hub

Accessing the remote CDP cluster HBase data from a...

Handle the newline characters within the data in H...

Data not inserting in hive table (CDP)

Write / Read Parquet File in Spark

Automating Python/R Conda environment setup in CDP...

Cloudera Data Platform (CDP) - Questions & Answers

Phoenix Timestamp - Leverage the core ROWKEY data ...

Working with Cloudera Data Platform (CDP) OpDB Dat...

Put data from Parquet files into DynamoDB with NiF...

Hive Query Recovery Tactics: Handling Failures thr...