<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/383141#M244838</link>
    <description>&lt;P&gt;&lt;SPAN&gt;Thank you, though your point about data integrity is valid, it's worth noting that PySpark has supported this feature since version 2.1, and there hasn't been any announcement about its removal. I believe this might be a bug.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 05 Feb 2024 13:50:18 GMT</pubDate>
    <dc:creator>sonnh</dc:creator>
    <dc:date>2024-02-05T13:50:18Z</dc:date>
    <item>
      <title>pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from</title>
      <link>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/382695#M244659</link>
      <description>&lt;DIV class="flex-1 overflow-hidden"&gt;&lt;DIV class="react-scroll-to-bottom--css-ibgga-79elbk h-full"&gt;&lt;DIV class="react-scroll-to-bottom--css-ibgga-1n7m0yu"&gt;&lt;DIV class="flex flex-col pb-9 text-sm"&gt;&lt;DIV class="w-full text-token-text-primary"&gt;&lt;DIV class="px-4 py-2 justify-center text-base md:gap-6 m-auto"&gt;&lt;DIV class="flex flex-1 text-base mx-auto gap-3 md:px-5 lg:px-1 xl:px-5 md:max-w-3xl lg:max-w-[40rem] xl:max-w-[48rem] group final-completion"&gt;&lt;DIV class=""&gt;&lt;DIV class="flex-col gap-1 md:gap-3"&gt;&lt;DIV class="flex flex-grow flex-col max-w-full"&gt;&lt;DIV class="min-h-[20px] text-message flex flex-col items-start gap-3 whitespace-pre-wrap break-words [.text-message+&amp;amp;]:mt-5 overflow-x-auto"&gt;&lt;DIV class="markdown prose w-full break-words dark:prose-invert dark"&gt;&lt;P&gt;Dears team, I have been using PySpark 3.4.2 with the following syntax:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;sql_query = "
INSERT OVERWRITE TABLE table_1 PARTITION(partition_date = {YYYYMMDD})
SELECT
    table_1.a
    , table_1.b
    , table_2.c
FROM table_2 change_capture_view
FULL OUTER JOIN (
    SELECT * FROM table_1 WHERE WHERE partition_date = {YYYYMMDD_D_1}
) current_view
    ON change_capture_view.a &amp;lt;=&amp;gt; current_view.a
WHERE change_capture_view.a IS NULL
"&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class="absolute"&gt;&lt;DIV class="flex w-full gap-2 items-center justify-center"&gt;and use spark.sql(&lt;SPAN&gt;sql_query &lt;/SPAN&gt;).&amp;nbsp;&lt;SPAN&gt;And encountered the error:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class="flex w-full gap-2 items-center justify-center"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/sql/session.py", line 1440, in sql
File "/usr/hdp/3.1.0.0-78/spark3/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco
pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from.&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In essence, I am trying to retrieve data from the partitions of previous dates to process and write into the partition of the current date on the same table.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class="relative flex h-full flex-1 items-stretch md:flex-col"&gt;&lt;DIV class="flex w-full items-center"&gt;&lt;SPAN&gt;Although previously, with the same syntax, I used it on PySpark 2.3.2.3.1.0.0-78 and it worked normally.&amp;nbsp;Can someone help me with this issue? I've tried creating a temporary table from table 1, but still encountered a similar error.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 25 Jan 2024 04:00:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/382695#M244659</guid>
      <dc:creator>sonnh</dc:creator>
      <dc:date>2024-01-25T04:00:52Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from</title>
      <link>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/382809#M244696</link>
      <description>&lt;P&gt;I think this is a bug in Spark. I followed their changes in the documentation (&lt;A href="https://spark.apache.org/docs/latest/sql-migration-guide.html" target="_new"&gt;https://spark.apache.org/docs/latest/sql-migration-guide.html&lt;/A&gt;), but I haven't seen any notes about this problem.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I find that there is another temporary solution to address this issue. We can directly write to the location of a desired partition on that table. I have implemented it as follows:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;-- Create a test table:
CREATE EXTERNAL TABLE IF NOT EXISTS staging.current_sonnh (
`date`	date,	
deal_id STRING,
hr_code STRING,
custid STRING
)
PARTITIONED BY (partition_date STRING)
STORED AS ORC
LOCATION '/lake/staging_zone/sonnh/current_sonnh'
TBLPROPERTIES("orc.compress"="SNAPPY", "external.table.purge"="true");&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;-- Insert sample data
INSERT  INTO TABLE
	staging.current_sonnh
	(
	`date`, deal_id, hr_code, custid, partition_date
	)
SELECT
	TO_DATE('2024-01-01') , 1234, 'HR1234', 'CI1234', 20240101;&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;Initialize the Spark session and perform as below:&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;x = spark.read.format("orc").load('/lake/staging_zone/sonnh/current_sonnh/partition_date=20240101')
spark.sql("ALTER table staging.current_sonnh ADD PARTITION (partition_date=20240102)")
x.write.mode("overwrite").orc("/lake/staging_zone/sonnh/current_sonnh/partition_date=20240102")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 29 Jan 2024 02:52:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/382809#M244696</guid>
      <dc:creator>sonnh</dc:creator>
      <dc:date>2024-01-29T02:52:52Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from</title>
      <link>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/383108#M244822</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/105062"&gt;@sonnh&lt;/a&gt;&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Generally it is not advisable to read and write the same table at a time.&amp;nbsp;It can result in anything between data corruption and complete data loss in case of failure.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;As a temporary solution, First create a temporary view by reading the table data and later you can use that data and finally save the data to destination table.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Reference:&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;A href="https://stackoverflow.com/questions/38746773/read-from-a-hive-table-and-write-back-to-it-using-spark-sql" target="_blank"&gt;https://stackoverflow.com/questions/38746773/read-from-a-hive-table-and-write-back-to-it-using-spark-sql&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://issues.apache.org/jira/browse/SPARK-27030" target="_blank"&gt;https://issues.apache.org/jira/browse/SPARK-27030&lt;/A&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Sun, 04 Feb 2024 15:58:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/383108#M244822</guid>
      <dc:creator>RangaReddy</dc:creator>
      <dc:date>2024-02-04T15:58:28Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from</title>
      <link>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/383141#M244838</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Thank you, though your point about data integrity is valid, it's worth noting that PySpark has supported this feature since version 2.1, and there hasn't been any announcement about its removal. I believe this might be a bug.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 05 Feb 2024 13:50:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/383141#M244838</guid>
      <dc:creator>sonnh</dc:creator>
      <dc:date>2024-02-05T13:50:18Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from</title>
      <link>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/383180#M244848</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/105062"&gt;@sonnh&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The way &lt;STRONG&gt;Spark&lt;/STRONG&gt; and &lt;STRONG&gt;Hive&lt;/STRONG&gt; handle &lt;STRONG&gt;reading&lt;/STRONG&gt; and &lt;STRONG&gt;writing&lt;/STRONG&gt; data back to the same table differs. &lt;STRONG&gt;Spark&lt;/STRONG&gt; typically &lt;STRONG&gt;clears the target path&lt;/STRONG&gt; before &lt;STRONG&gt;writing new data&lt;/STRONG&gt;, while &lt;STRONG&gt;Hive&lt;/STRONG&gt; writes to a &lt;STRONG&gt;temporary directory first&lt;/STRONG&gt; and then &lt;STRONG&gt;replaces&lt;/STRONG&gt; the &lt;STRONG&gt;target path&lt;/STRONG&gt; with the result data upon task completion.&lt;/P&gt;&lt;P&gt;When working with specific file formats like ORC or Parquet and interacting with Hive metastore, consider adjusting these Spark settings as needed:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;--conf spark.sql.hive.convertMetastoreParquet=false&lt;/LI&gt;&lt;LI&gt;--conf spark.sql.hive.convertMetastoreOrc=false&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Reference:&lt;/STRONG&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://community.cloudera.com/t5/Support-Questions/Insert-overwrite-with-in-the-same-table-in-spark/m-p/242780" target="_blank" rel="noopener"&gt;https://community.cloudera.com/t5/Support-Questions/Insert-overwrite-with-in-the-same-table-in-spark/m-p/242780&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://www.baifachuan.com/posts/da7bb348.html" target="_blank" rel="noopener"&gt;https://www.baifachuan.com/posts/da7bb348.html&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Tue, 06 Feb 2024 04:58:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/383180#M244848</guid>
      <dc:creator>RangaReddy</dc:creator>
      <dc:date>2024-02-06T04:58:43Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from</title>
      <link>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/383181#M244849</link>
      <description>&lt;P&gt;If above answers are helped you, please accept as Solution. It will helpful for others.&lt;/P&gt;</description>
      <pubDate>Tue, 06 Feb 2024 04:59:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/383181#M244849</guid>
      <dc:creator>RangaReddy</dc:creator>
      <dc:date>2024-02-06T04:59:56Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from</title>
      <link>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/383557#M244951</link>
      <description>&lt;P&gt;Thank you, my friend. A week ago, I read through your configurations in the official documentation and experimented with them. However, I encountered an error along the lines of 'class not found.' Currently, I've identified the root cause: I'm using HDP 3.1.0, which includes PySpark 2.3.2.3.1.0.0-78. Therefore, I upgraded it to PySpark 3, while still using the standalone-metastore-1.21.2.3.1.0.0-78-hive3.jar file by default. That's the reason why, when using the configuration, I encountered the 'class not found' error. Now, I've replaced that JAR file with hive-metastore-2.3.9.jar. Everything is working fine now. Once again, thank you, my friend.&lt;/P&gt;</description>
      <pubDate>Fri, 16 Feb 2024 10:54:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/pyspark-errors-exceptions-captured-AnalysisException-Cannot/m-p/383557#M244951</guid>
      <dc:creator>sonnh</dc:creator>
      <dc:date>2024-02-16T10:54:49Z</dc:date>
    </item>
  </channel>
</rss>

