- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
spark.sql.sources.partitionOverwriteMode=dynamic" not working in CDP 7.1.4
- Labels:
-
Cloudera Data Platform (CDP)
Created on ‎11-08-2020 05:55 PM - edited ‎11-09-2020 12:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are using spark.sql.sources.partitionOverwriteMode=dynamic" in our pyspark scripts in our CDH 6.3.2 cluster with spark version 2.4.0, but when we are trying it to CDP 7.1.4 with Spark 2.4.0 version and it is not working, is there anyway to have the config spark.sql.sources.partitionOverwriteMode=dynamic" work in CDP? is there any alternatives on it?
Just to highlight that both our CDP 7.1.4 and CDH 6.3.2 clusters are having the same Spark version of 2.4.0
Created on ‎07-26-2022 04:17 AM - edited ‎07-26-2022 04:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Team,
CDP uses the "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol" OutputCommitter which does not support dynamicPartitionOverwrite.
You can set the following parameters into your spark job.
code level:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") spark.conf.set("spark.sql.parquet.output.committer.class", "org.apache.parquet.hadoop.ParquetOutputCommitter") spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")
spark-submit/spark-shell:
--conf spark.sql.sources.partitionOverwriteMode=dynamic
--conf spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
--conf spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
Note: If you are using S3, you can disable it by specifying spark.cloudera.s3_committers.enabled parameter.
--conf spark.cloduera.s3_committers.enabled=false
Created ‎11-24-2020 09:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm having this same issue whether I specify this config in the spark-defaults.conf via Cloudera Manager for CDP 7.1.4 or inline in my spark.write.option("partitionOverwriteMode", "dynamic").
Error message is:
java.io.IOException: PathOutputCommitProtocol does not support dynamicPartitionOverwrite
Created ‎05-09-2022 09:56 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was getting the same error after Cloudera upgradation while using insert overwrite with config spark.sql.sources.partitionOverwriteMode=dynamic. For me below config property resolved the issue.
"spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol"
Since after upgrade default value was "spark.sql.sources.commitProtocolClass= org.apache.spark.internal.io.cloud.PathOutputCommitProtocol" and it was creating issue.
Created ‎11-24-2020 01:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was able to fix this in our CDP 7.1.4 cluster today by disabling the
Enable Optimized S3 Committers - spark.cloudera.s3_committers.enabled
in the Spark Service Configuration
This works for me because we are using HDFS on premise. If you are using S3, I'm guessing that this is put in place because of the S3 eventual consistency issues.
I've then also added the spark.sql.sources.partitionOverwriteMode=dynamic setting to my spark-defaults.conf also in Spark Service Configuration via the Safety Valve settings.
Created ‎12-15-2021 03:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It works also for me using CDP 7.1.7
Thank you
Created on ‎07-26-2022 04:17 AM - edited ‎07-26-2022 04:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Team,
CDP uses the "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol" OutputCommitter which does not support dynamicPartitionOverwrite.
You can set the following parameters into your spark job.
code level:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") spark.conf.set("spark.sql.parquet.output.committer.class", "org.apache.parquet.hadoop.ParquetOutputCommitter") spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")
spark-submit/spark-shell:
--conf spark.sql.sources.partitionOverwriteMode=dynamic
--conf spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
--conf spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
Note: If you are using S3, you can disable it by specifying spark.cloudera.s3_committers.enabled parameter.
--conf spark.cloduera.s3_committers.enabled=false
