question Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from in Support Questions

pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from

sonnh — Thu, 25 Jan 2024 04:00:52 GMT

Dears team, I have been using PySpark 3.4.2 with the following syntax:

sql_query = " INSERT OVERWRITE TABLE table_1 PARTITION(partition_date = {YYYYMMDD}) SELECT table_1.a , table_1.b , table_2.c FROM table_2 change_capture_view FULL OUTER JOIN ( SELECT * FROM table_1 WHERE WHERE partition_date = {YYYYMMDD_D_1} ) current_view ON change_capture_view.a <=> current_view.a WHERE change_capture_view.a IS NULL "

and use spark.sql(sql_query ). And encountered the error:

File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/sql/session.py", line 1440, in sql File "/usr/hdp/3.1.0.0-78/spark3/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from.

In essence, I am trying to retrieve data from the partitions of previous dates to process and write into the partition of the current date on the same table.

Although previously, with the same syntax, I used it on PySpark 2.3.2.3.1.0.0-78 and it worked normally. Can someone help me with this issue? I've tried creating a temporary table from table 1, but still encountered a similar error.

Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from

sonnh — Mon, 29 Jan 2024 02:52:52 GMT

I think this is a bug in Spark. I followed their changes in the documentation (https://spark.apache.org/docs/latest/sql-migration-guide.html), but I haven't seen any notes about this problem.

I find that there is another temporary solution to address this issue. We can directly write to the location of a desired partition on that table. I have implemented it as follows:

-- Create a test table: CREATE EXTERNAL TABLE IF NOT EXISTS staging.current_sonnh ( `date` date, deal_id STRING, hr_code STRING, custid STRING ) PARTITIONED BY (partition_date STRING) STORED AS ORC LOCATION '/lake/staging_zone/sonnh/current_sonnh' TBLPROPERTIES("orc.compress"="SNAPPY", "external.table.purge"="true");

-- Insert sample data INSERT INTO TABLE staging.current_sonnh ( `date`, deal_id, hr_code, custid, partition_date ) SELECT TO_DATE('2024-01-01') , 1234, 'HR1234', 'CI1234', 20240101;

Initialize the Spark session and perform as below:

x = spark.read.format("orc").load('/lake/staging_zone/sonnh/current_sonnh/partition_date=20240101') spark.sql("ALTER table staging.current_sonnh ADD PARTITION (partition_date=20240102)") x.write.mode("overwrite").orc("/lake/staging_zone/sonnh/current_sonnh/partition_date=20240102")

Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from

RangaReddy — Sun, 04 Feb 2024 15:58:28 GMT

Hi @sonnh

Generally it is not advisable to read and write the same table at a time. It can result in anything between data corruption and complete data loss in case of failure.

As a temporary solution, First create a temporary view by reading the table data and later you can use that data and finally save the data to destination table.

Reference:

Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from

sonnh — Mon, 05 Feb 2024 13:50:18 GMT

Thank you, though your point about data integrity is valid, it's worth noting that PySpark has supported this feature since version 2.1, and there hasn't been any announcement about its removal. I believe this might be a bug.

Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from

RangaReddy — Tue, 06 Feb 2024 04:58:43 GMT

Hi @sonnh

The way Spark and Hive handle reading and writing data back to the same table differs. Spark typically clears the target path before writing new data, while Hive writes to a temporary directory first and then replaces the target path with the result data upon task completion.

When working with specific file formats like ORC or Parquet and interacting with Hive metastore, consider adjusting these Spark settings as needed:

--conf spark.sql.hive.convertMetastoreParquet=false
--conf spark.sql.hive.convertMetastoreOrc=false

Reference:

Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from

RangaReddy — Tue, 06 Feb 2024 04:59:56 GMT

If above answers are helped you, please accept as Solution. It will helpful for others.

Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from

sonnh — Fri, 16 Feb 2024 10:54:49 GMT

Thank you, my friend. A week ago, I read through your configurations in the official documentation and experimented with them. However, I encountered an error along the lines of 'class not found.' Currently, I've identified the root cause: I'm using HDP 3.1.0, which includes PySpark 2.3.2.3.1.0.0-78. Therefore, I upgraded it to PySpark 3, while still using the standalone-metastore-1.21.2.3.1.0.0-78-hive3.jar file by default. That's the reason why, when using the configuration, I encountered the 'class not found' error. Now, I've replaced that JAR file with hive-metastore-2.3.9.jar. Everything is working fine now. Once again, thank you, my friend.