Hello Experts,
My current platform is build based on below components.
CDP 7.1.7
Hive 3
Spark 2.4.7
DeltaLibrary
Hadoop 3.1
All our data tables are hive external tables in parquet format.
At present we have pyspark streaming solution in place to read json data from a golden gate feed and process through pyspark and deltalibrary solution in cyclic fashion.
For each cycle, when streaming data is processed on delta path, we merge the data from delta path to hive external path but during this update on hive path, if there is someone who performs a query on this table gets an error that HDFS I/O file not found error.
To solve this, we wanted to perform automic update on hive external table such that read queries are not failed when updates happening. We have tried below ways but didn't work.
1) spark.sql
2) hive.sql
3) df.write (using jdbc connection)
4) Regular df.write with overwrite mode and overwrite = True
None of these worked, however it looks like "insert into overwrite .. " query works fine via dbeaver -> hive or impala editor while read query is not failed. Please help with your suggestions to solve this via programmatically in pyspark way.
Are external hive tables support automic operations?
Thanks
Sagar