Member since
05-15-2023
12
Posts
2
Kudos Received
0
Solutions
02-16-2024
02:54 AM
1 Kudo
Thank you, my friend. A week ago, I read through your configurations in the official documentation and experimented with them. However, I encountered an error along the lines of 'class not found.' Currently, I've identified the root cause: I'm using HDP 3.1.0, which includes PySpark 2.3.2.3.1.0.0-78. Therefore, I upgraded it to PySpark 3, while still using the standalone-metastore-1.21.2.3.1.0.0-78-hive3.jar file by default. That's the reason why, when using the configuration, I encountered the 'class not found' error. Now, I've replaced that JAR file with hive-metastore-2.3.9.jar. Everything is working fine now. Once again, thank you, my friend.
... View more
02-05-2024
05:50 AM
Thank you, though your point about data integrity is valid, it's worth noting that PySpark has supported this feature since version 2.1, and there hasn't been any announcement about its removal. I believe this might be a bug.
... View more
01-28-2024
06:51 PM
I think this is a bug in Spark. I followed their changes in the documentation (https://spark.apache.org/docs/latest/sql-migration-guide.html), but I haven't seen any notes about this problem. I find that there is another temporary solution to address this issue. We can directly write to the location of a desired partition on that table. I have implemented it as follows: -- Create a test table:
CREATE EXTERNAL TABLE IF NOT EXISTS staging.current_sonnh (
`date` date,
deal_id STRING,
hr_code STRING,
custid STRING
)
PARTITIONED BY (partition_date STRING)
STORED AS ORC
LOCATION '/lake/staging_zone/sonnh/current_sonnh'
TBLPROPERTIES("orc.compress"="SNAPPY", "external.table.purge"="true"); -- Insert sample data
INSERT INTO TABLE
staging.current_sonnh
(
`date`, deal_id, hr_code, custid, partition_date
)
SELECT
TO_DATE('2024-01-01') , 1234, 'HR1234', 'CI1234', 20240101; Initialize the Spark session and perform as below: x = spark.read.format("orc").load('/lake/staging_zone/sonnh/current_sonnh/partition_date=20240101')
spark.sql("ALTER table staging.current_sonnh ADD PARTITION (partition_date=20240102)")
x.write.mode("overwrite").orc("/lake/staging_zone/sonnh/current_sonnh/partition_date=20240102")
... View more
01-24-2024
07:28 PM
1 Kudo
Dears team, I have been using PySpark 3.4.2 with the following syntax: sql_query = "
INSERT OVERWRITE TABLE table_1 PARTITION(partition_date = {YYYYMMDD})
SELECT
table_1.a
, table_1.b
, table_2.c
FROM table_2 change_capture_view
FULL OUTER JOIN (
SELECT * FROM table_1 WHERE WHERE partition_date = {YYYYMMDD_D_1}
) current_view
ON change_capture_view.a <=> current_view.a
WHERE change_capture_view.a IS NULL
" and use spark.sql(sql_query ). And encountered the error: File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/sql/session.py", line 1440, in sql
File "/usr/hdp/3.1.0.0-78/spark3/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco
pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from. In essence, I am trying to retrieve data from the partitions of previous dates to process and write into the partition of the current date on the same table. Although previously, with the same syntax, I used it on PySpark 2.3.2.3.1.0.0-78 and it worked normally. Can someone help me with this issue? I've tried creating a temporary table from table 1, but still encountered a similar error.
... View more
Labels:
- Labels:
-
Apache Spark
09-15-2023
03:16 AM
i got same the error.
... View more
08-18-2023
12:23 AM
Tks @RangaReddy My purpose is to collect a series of pagings from an RDBMS and compare it with JVM_HEAP_MEMORY. Do you find this approach acceptable? I believe it could help alleviate the issue of small files on HDFS. I'm facing difficulties in calculating the size of the DataFrame. It seems there's no straightforward way to accomplish it
... View more
08-14-2023
01:34 AM
I am using Spark 2.3.2.3.1.0.0-78. I tried to use: spark_session.sparkContext._conf.get('spark.executor.memory') but I only received 'None'. Can someone help me, please?
... View more
Labels:
- Labels:
-
Apache Spark
07-17-2023
01:55 AM
Hello Team , We are using HDP-3.1.0 We have executed import-hive.sh script to import already existing hive tables in Atlas. It got successfully executed. Now we can see all hive databases and tables in Atlas, but we are not able to see data linage of that imported tables. If we create external table on any hdfs path then we can see lineage in Atlas. Also, If we create any managed tables in Atlas, we are not able to see lineage of that old tables. But we still can see lineage of that new tables. Why we are not getting lineage of older tables? Please suggest. We are stuck now. Thanks,
... View more
Labels:
- Labels:
-
Apache Atlas
07-05-2023
09:03 PM
It works for me
... View more
05-16-2023
02:11 AM
tks RangaReddy, Not only in my team but in many other companies, they also encounter this issue with Spark Thrift Server. Do I need to provide any additional information to create a Cloudera case based on the description above?
... View more