About sonnh

sonnh · ‎12-11-2024

I don't understand what the benefit of doing it this way is. As far as I know, when creating a table in Hive, a new entity of type hive_table is automatically created in Atlas. This happens automatically, in contrast to your manual approach. Am I misunderstanding something? Could you please explain it to me?

sonnh · ‎02-16-2024

Thank you, my friend. A week ago, I read through your configurations in the official documentation and experimented with them. However, I encountered an error along the lines of 'class not found.' Currently, I've identified the root cause: I'm using HDP 3.1.0, which includes PySpark 2.3.2.3.1.0.0-78. Therefore, I upgraded it to PySpark 3, while still using the standalone-metastore-1.21.2.3.1.0.0-78-hive3.jar file by default. That's the reason why, when using the configuration, I encountered the 'class not found' error. Now, I've replaced that JAR file with hive-metastore-2.3.9.jar. Everything is working fine now. Once again, thank you, my friend.

sonnh · ‎02-05-2024

Thank you, though your point about data integrity is valid, it's worth noting that PySpark has supported this feature since version 2.1, and there hasn't been any announcement about its removal. I believe this might be a bug.

sonnh · ‎01-28-2024

I think this is a bug in Spark. I followed their changes in the documentation (https://spark.apache.org/docs/latest/sql-migration-guide.html), but I haven't seen any notes about this problem. I find that there is another temporary solution to address this issue. We can directly write to the location of a desired partition on that table. I have implemented it as follows: -- Create a test table: CREATE EXTERNAL TABLE IF NOT EXISTS staging.current_sonnh ( `date` date, deal_id STRING, hr_code STRING, custid STRING ) PARTITIONED BY (partition_date STRING) STORED AS ORC LOCATION '/lake/staging_zone/sonnh/current_sonnh' TBLPROPERTIES("orc.compress"="SNAPPY", "external.table.purge"="true"); -- Insert sample data INSERT INTO TABLE staging.current_sonnh ( `date`, deal_id, hr_code, custid, partition_date ) SELECT TO_DATE('2024-01-01') , 1234, 'HR1234', 'CI1234', 20240101; Initialize the Spark session and perform as below: x = spark.read.format("orc").load('/lake/staging_zone/sonnh/current_sonnh/partition_date=20240101') spark.sql("ALTER table staging.current_sonnh ADD PARTITION (partition_date=20240102)") x.write.mode("overwrite").orc("/lake/staging_zone/sonnh/current_sonnh/partition_date=20240102")

sonnh · ‎01-24-2024

Dears team, I have been using PySpark 3.4.2 with the following syntax: sql_query = " INSERT OVERWRITE TABLE table_1 PARTITION(partition_date = {YYYYMMDD}) SELECT table_1.a , table_1.b , table_2.c FROM table_2 change_capture_view FULL OUTER JOIN ( SELECT * FROM table_1 WHERE WHERE partition_date = {YYYYMMDD_D_1} ) current_view ON change_capture_view.a <=> current_view.a WHERE change_capture_view.a IS NULL " and use spark.sql(sql_query ). And encountered the error: File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/sql/session.py", line 1440, in sql File "/usr/hdp/3.1.0.0-78/spark3/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/usr/hdp/3.1.0.0-78/spark3/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco pyspark.errors.exceptions.captured.AnalysisException: Cannot overwrite a path that is also being read from. In essence, I am trying to retrieve data from the partitions of previous dates to process and write into the partition of the current date on the same table. Although previously, with the same syntax, I used it on PySpark 2.3.2.3.1.0.0-78 and it worked normally. Can someone help me with this issue? I've tried creating a temporary table from table 1, but still encountered a similar error.

sonnh · ‎09-15-2023

i got same the error.

sonnh · ‎08-18-2023

Tks @RangaReddy My purpose is to collect a series of pagings from an RDBMS and compare it with JVM_HEAP_MEMORY. Do you find this approach acceptable? I believe it could help alleviate the issue of small files on HDFS. I'm facing difficulties in calculating the size of the DataFrame. It seems there's no straightforward way to accomplish it

sonnh · ‎08-14-2023

I am using Spark 2.3.2.3.1.0.0-78. I tried to use: spark_session.sparkContext._conf.get('spark.executor.memory') but I only received 'None'. Can someone help me, please?

sonnh · ‎07-17-2023

Hello Team , We are using HDP-3.1.0 We have executed import-hive.sh script to import already existing hive tables in Atlas. It got successfully executed. Now we can see all hive databases and tables in Atlas, but we are not able to see data linage of that imported tables. If we create external table on any hdfs path then we can see lineage in Atlas. Also, If we create any managed tables in Atlas, we are not able to see lineage of that old tables. But we still can see lineage of that new tables. Why we are not getting lineage of older tables? Please suggest. We are stuck now. Thanks,

sonnh · ‎07-05-2023

It works for me

Online	Offline
Last Visited	‎12-23-2024 09:28 AM

Member Since	‎05-15-2023 01:44 AM
Last Visited	‎12-23-2024 09:28 AM
Posts	13
Kudos received	2

Cloudera Community

Re: Customizing Atlas (Part1): Model governance, t...

Re: pyspark.errors.exceptions.captured.AnalysisExc...

Re: pyspark.errors.exceptions.captured.AnalysisExc...

Re: pyspark.errors.exceptions.captured.AnalysisExc...

pyspark.errors.exceptions.captured.AnalysisExcepti...

Re: HDFS Ranger plugin policies are not working as...

Re: How to get spark.executor.memory size of spark...

How to get spark.executor.memory size of spark ses...

Lineage is not visible for old Hive Table in Atlas

Re: how to change default Atlas UI admin password