Created 02-06-2023 01:25 PM
Hello Experts,
My current platform is build based on below components.
CDP 7.1.7
Hive 3
Spark 2.4.7
DeltaLibrary
Hadoop 3.1
All our data tables are hive external tables in parquet format.
At present we have pyspark streaming solution in place to read json data from a golden gate feed and process through pyspark and deltalibrary solution in cyclic fashion.
For each cycle, when streaming data is processed on delta path, we merge the data from delta path to hive external path but during this update on hive path, if there is someone who performs a query on this table gets an error that HDFS I/O file not found error.
To solve this, we wanted to perform automic update on hive external table such that read queries are not failed when updates happening. We have tried below ways but didn't work.
1) spark.sql
2) hive.sql
3) df.write (using jdbc connection)
4) Regular df.write with overwrite mode and overwrite = True
None of these worked, however it looks like "insert into overwrite .. " query works fine via dbeaver -> hive or impala editor while read query is not failed. Please help with your suggestions to solve this via programmatically in pyspark way.
Are external hive tables support automic operations?
Thanks
Sagar
Created 02-10-2023 10:25 AM
Any help on this is much appreciated.!
Thanks
Sagar
Created 06-16-2023 12:04 AM
Check the possibility of using hive managed table.
As part of hive managed tables, you won't require separate merge job , as hive compaction takes care by default if compaction is enabled.
You can access managed tables through HWC (Hive Warehouse Connector) from Spark.