Created on 01-24-2024 07:28 PM - edited 01-24-2024 08:00 PM
Dears team, I have been using PySpark 3.4.2 with the following syntax:
sql_query = "
INSERT OVERWRITE TABLE table_1 PARTITION(partition_date = {YYYYMMDD})
SELECT
table_1.a
, table_1.b
, table_2.c
FROM table_2 change_capture_view
FULL OUTER JOIN (
SELECT * FROM table_1 WHERE WHERE partition_date = {YYYYMMDD_D_1}
) current_view
ON change_capture_view.a <=> current_view.a
WHERE change_capture_view.a IS NULL
"
File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/sql/session.py", line 1440, in sql
File "/usr/hdp/3.1.0.0-78/spark3/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco
pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from.
In essence, I am trying to retrieve data from the partitions of previous dates to process and write into the partition of the current date on the same table.
Created 02-05-2024 08:58 PM
Hi @sonnh
The way Spark and Hive handle reading and writing data back to the same table differs. Spark typically clears the target path before writing new data, while Hive writes to a temporary directory first and then replaces the target path with the result data upon task completion.
When working with specific file formats like ORC or Parquet and interacting with Hive metastore, consider adjusting these Spark settings as needed:
Reference:
Created on 01-28-2024 06:51 PM - edited 01-28-2024 06:52 PM
I think this is a bug in Spark. I followed their changes in the documentation (https://spark.apache.org/docs/latest/sql-migration-guide.html), but I haven't seen any notes about this problem.
I find that there is another temporary solution to address this issue. We can directly write to the location of a desired partition on that table. I have implemented it as follows:
-- Create a test table:
CREATE EXTERNAL TABLE IF NOT EXISTS staging.current_sonnh (
`date` date,
deal_id STRING,
hr_code STRING,
custid STRING
)
PARTITIONED BY (partition_date STRING)
STORED AS ORC
LOCATION '/lake/staging_zone/sonnh/current_sonnh'
TBLPROPERTIES("orc.compress"="SNAPPY", "external.table.purge"="true");
-- Insert sample data
INSERT INTO TABLE
staging.current_sonnh
(
`date`, deal_id, hr_code, custid, partition_date
)
SELECT
TO_DATE('2024-01-01') , 1234, 'HR1234', 'CI1234', 20240101;
Initialize the Spark session and perform as below:
x = spark.read.format("orc").load('/lake/staging_zone/sonnh/current_sonnh/partition_date=20240101')
spark.sql("ALTER table staging.current_sonnh ADD PARTITION (partition_date=20240102)")
x.write.mode("overwrite").orc("/lake/staging_zone/sonnh/current_sonnh/partition_date=20240102")
Created 02-04-2024 07:58 AM
Hi @sonnh
Generally it is not advisable to read and write the same table at a time. It can result in anything between data corruption and complete data loss in case of failure.
As a temporary solution, First create a temporary view by reading the table data and later you can use that data and finally save the data to destination table.
Reference:
Created 02-05-2024 05:50 AM
Thank you, though your point about data integrity is valid, it's worth noting that PySpark has supported this feature since version 2.1, and there hasn't been any announcement about its removal. I believe this might be a bug.
Created 02-05-2024 08:58 PM
Hi @sonnh
The way Spark and Hive handle reading and writing data back to the same table differs. Spark typically clears the target path before writing new data, while Hive writes to a temporary directory first and then replaces the target path with the result data upon task completion.
When working with specific file formats like ORC or Parquet and interacting with Hive metastore, consider adjusting these Spark settings as needed:
Reference:
Created on 02-16-2024 02:54 AM - edited 02-16-2024 02:54 AM
Thank you, my friend. A week ago, I read through your configurations in the official documentation and experimented with them. However, I encountered an error along the lines of 'class not found.' Currently, I've identified the root cause: I'm using HDP 3.1.0, which includes PySpark 2.3.2.3.1.0.0-78. Therefore, I upgraded it to PySpark 3, while still using the standalone-metastore-1.21.2.3.1.0.0-78-hive3.jar file by default. That's the reason why, when using the configuration, I encountered the 'class not found' error. Now, I've replaced that JAR file with hive-metastore-2.3.9.jar. Everything is working fine now. Once again, thank you, my friend.
Created 02-05-2024 08:59 PM
If above answers are helped you, please accept as Solution. It will helpful for others.