Support Questions

sonnh · ‎01-24-2024

Dears team, I have been using PySpark 3.4.2 with the following syntax:

sql_query = "
INSERT OVERWRITE TABLE table_1 PARTITION(partition_date = {YYYYMMDD})
SELECT
    table_1.a
    , table_1.b
    , table_2.c
FROM table_2 change_capture_view
FULL OUTER JOIN (
    SELECT * FROM table_1 WHERE WHERE partition_date = {YYYYMMDD_D_1}
) current_view
    ON change_capture_view.a <=> current_view.a
WHERE change_capture_view.a IS NULL
"

and use spark.sql(sql_query ). And encountered the error:

File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/sql/session.py", line 1440, in sql
File "/usr/hdp/3.1.0.0-78/spark3/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco
pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from.

In essence, I am trying to retrieve data from the partitions of previous dates to process and write into the partition of the current date on the same table.

Although previously, with the same syntax, I used it on PySpark 2.3.2.3.1.0.0-78 and it worked normally. Can someone help me with this issue? I've tried creating a temporary table from table 1, but still encountered a similar error.

RangaReddy · ‎02-05-2024

Hi @sonnh

The way Spark and Hive handle reading and writing data back to the same table differs. Spark typically clears the target path before writing new data, while Hive writes to a temporary directory first and then replaces the target path with the result data upon task completion.

When working with specific file formats like ORC or Parquet and interacting with Hive metastore, consider adjusting these Spark settings as needed:

--conf spark.sql.hive.convertMetastoreParquet=false
--conf spark.sql.hive.convertMetastoreOrc=false

Reference:

View solution in original post

sonnh · ‎01-28-2024

I think this is a bug in Spark. I followed their changes in the documentation (https://spark.apache.org/docs/latest/sql-migration-guide.html), but I haven't seen any notes about this problem.

I find that there is another temporary solution to address this issue. We can directly write to the location of a desired partition on that table. I have implemented it as follows:

-- Create a test table:
CREATE EXTERNAL TABLE IF NOT EXISTS staging.current_sonnh (
`date`	date,	
deal_id STRING,
hr_code STRING,
custid STRING
)
PARTITIONED BY (partition_date STRING)
STORED AS ORC
LOCATION '/lake/staging_zone/sonnh/current_sonnh'
TBLPROPERTIES("orc.compress"="SNAPPY", "external.table.purge"="true");

-- Insert sample data
INSERT  INTO TABLE
	staging.current_sonnh
	(
	`date`, deal_id, hr_code, custid, partition_date
	)
SELECT
	TO_DATE('2024-01-01') , 1234, 'HR1234', 'CI1234', 20240101;

Initialize the Spark session and perform as below:

x = spark.read.format("orc").load('/lake/staging_zone/sonnh/current_sonnh/partition_date=20240101')
spark.sql("ALTER table staging.current_sonnh ADD PARTITION (partition_date=20240102)")
x.write.mode("overwrite").orc("/lake/staging_zone/sonnh/current_sonnh/partition_date=20240102")

RangaReddy · ‎02-04-2024

Hi @sonnh

Generally it is not advisable to read and write the same table at a time. It can result in anything between data corruption and complete data loss in case of failure.

As a temporary solution, First create a temporary view by reading the table data and later you can use that data and finally save the data to destination table.

Reference:

sonnh · ‎02-05-2024

Thank you, though your point about data integrity is valid, it's worth noting that PySpark has supported this feature since version 2.1, and there hasn't been any announcement about its removal. I believe this might be a bug.

RangaReddy · ‎02-05-2024

Hi @sonnh

The way Spark and Hive handle reading and writing data back to the same table differs. Spark typically clears the target path before writing new data, while Hive writes to a temporary directory first and then replaces the target path with the result data upon task completion.

When working with specific file formats like ORC or Parquet and interacting with Hive metastore, consider adjusting these Spark settings as needed:

--conf spark.sql.hive.convertMetastoreParquet=false
--conf spark.sql.hive.convertMetastoreOrc=false

Reference:

sonnh · ‎02-16-2024

Thank you, my friend. A week ago, I read through your configurations in the official documentation and experimented with them. However, I encountered an error along the lines of 'class not found.' Currently, I've identified the root cause: I'm using HDP 3.1.0, which includes PySpark 2.3.2.3.1.0.0-78. Therefore, I upgraded it to PySpark 3, while still using the standalone-metastore-1.21.2.3.1.0.0-78-hive3.jar file by default. That's the reason why, when using the configuration, I encountered the 'class not found' error. Now, I've replaced that JAR file with hive-metastore-2.3.9.jar. Everything is working fine now. Once again, thank you, my friend.

RangaReddy · ‎02-05-2024

If above answers are helped you, please accept as Solution. It will helpful for others.

Cloudera Community

Support Questions

pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from

putFile Overwrite File

Write / Read Parquet File in Spark

How to read fsimage

FlowFile not being read and/or written when tryin...

Spark to read the Hive table sub-directory data

Read from HDFS in R script

Apache SPARK - Overwrite data file

insert overwrite directory - beeline

Build and use Parquet-tools to read parquet files

Reading ORC files using Mapreduce