Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from

avatar
Contributor

Dears team, I have been using PySpark 3.4.2 with the following syntax:

 

 

sql_query = "
INSERT OVERWRITE TABLE table_1 PARTITION(partition_date = {YYYYMMDD})
SELECT
    table_1.a
    , table_1.b
    , table_2.c
FROM table_2 change_capture_view
FULL OUTER JOIN (
    SELECT * FROM table_1 WHERE WHERE partition_date = {YYYYMMDD_D_1}
) current_view
    ON change_capture_view.a <=> current_view.a
WHERE change_capture_view.a IS NULL
"

 

 

and use spark.sql(sql_query ). And encountered the error:
 

 

File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/sql/session.py", line 1440, in sql
File "/usr/hdp/3.1.0.0-78/spark3/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco
pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from.

 

In essence, I am trying to retrieve data from the partitions of previous dates to process and write into the partition of the current date on the same table.

 

Although previously, with the same syntax, I used it on PySpark 2.3.2.3.1.0.0-78 and it worked normally. Can someone help me with this issue? I've tried creating a temporary table from table 1, but still encountered a similar error.
1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hi @sonnh 

The way Spark and Hive handle reading and writing data back to the same table differs. Spark typically clears the target path before writing new data, while Hive writes to a temporary directory first and then replaces the target path with the result data upon task completion.

When working with specific file formats like ORC or Parquet and interacting with Hive metastore, consider adjusting these Spark settings as needed:

  • --conf spark.sql.hive.convertMetastoreParquet=false
  • --conf spark.sql.hive.convertMetastoreOrc=false

Reference:

  1. https://community.cloudera.com/t5/Support-Questions/Insert-overwrite-with-in-the-same-table-in-spark...
  2. https://www.baifachuan.com/posts/da7bb348.html

View solution in original post

6 REPLIES 6

avatar
Contributor

I think this is a bug in Spark. I followed their changes in the documentation (https://spark.apache.org/docs/latest/sql-migration-guide.html), but I haven't seen any notes about this problem.

I find that there is another temporary solution to address this issue. We can directly write to the location of a desired partition on that table. I have implemented it as follows:

 

-- Create a test table:
CREATE EXTERNAL TABLE IF NOT EXISTS staging.current_sonnh (
`date`	date,	
deal_id STRING,
hr_code STRING,
custid STRING
)
PARTITIONED BY (partition_date STRING)
STORED AS ORC
LOCATION '/lake/staging_zone/sonnh/current_sonnh'
TBLPROPERTIES("orc.compress"="SNAPPY", "external.table.purge"="true");

 

-- Insert sample data
INSERT  INTO TABLE
	staging.current_sonnh
	(
	`date`, deal_id, hr_code, custid, partition_date
	)
SELECT
	TO_DATE('2024-01-01') , 1234, 'HR1234', 'CI1234', 20240101;

Initialize the Spark session and perform as below:

 

x = spark.read.format("orc").load('/lake/staging_zone/sonnh/current_sonnh/partition_date=20240101')
spark.sql("ALTER table staging.current_sonnh ADD PARTITION (partition_date=20240102)")
x.write.mode("overwrite").orc("/lake/staging_zone/sonnh/current_sonnh/partition_date=20240102")

 

 

avatar
Super Collaborator

Hi @sonnh 

Generally it is not advisable to read and write the same table at a time. It can result in anything between data corruption and complete data loss in case of failure.

As a temporary solution, First create a temporary view by reading the table data and later you can use that data and finally save the data to destination table.

Reference: 

  1. https://stackoverflow.com/questions/38746773/read-from-a-hive-table-and-write-back-to-it-using-spark...
  2. https://issues.apache.org/jira/browse/SPARK-27030

avatar
Contributor

Thank you, though your point about data integrity is valid, it's worth noting that PySpark has supported this feature since version 2.1, and there hasn't been any announcement about its removal. I believe this might be a bug.

avatar
Super Collaborator

Hi @sonnh 

The way Spark and Hive handle reading and writing data back to the same table differs. Spark typically clears the target path before writing new data, while Hive writes to a temporary directory first and then replaces the target path with the result data upon task completion.

When working with specific file formats like ORC or Parquet and interacting with Hive metastore, consider adjusting these Spark settings as needed:

  • --conf spark.sql.hive.convertMetastoreParquet=false
  • --conf spark.sql.hive.convertMetastoreOrc=false

Reference:

  1. https://community.cloudera.com/t5/Support-Questions/Insert-overwrite-with-in-the-same-table-in-spark...
  2. https://www.baifachuan.com/posts/da7bb348.html

avatar
Contributor

Thank you, my friend. A week ago, I read through your configurations in the official documentation and experimented with them. However, I encountered an error along the lines of 'class not found.' Currently, I've identified the root cause: I'm using HDP 3.1.0, which includes PySpark 2.3.2.3.1.0.0-78. Therefore, I upgraded it to PySpark 3, while still using the standalone-metastore-1.21.2.3.1.0.0-78-hive3.jar file by default. That's the reason why, when using the configuration, I encountered the 'class not found' error. Now, I've replaced that JAR file with hive-metastore-2.3.9.jar. Everything is working fine now. Once again, thank you, my friend.

avatar
Super Collaborator

If above answers are helped you, please accept as Solution. It will helpful for others.