Problem statement:
When i try to query iceberg table for current date which recieves data from streaming pipeline in interval of 5 mins
Example: select * from <table> where result_date="<current_date>" limit 1;
Error:
ImpalaRuntimeException: Cannot find file in cache:: Cannot find file in cache: hdfs://xx/ya/Zzz/data/resulted/00004-22575-da5239e5-71d0-4b2f-af6b-73cbf4b7d9c5-46884-00001.parquet with snapshot id: 2154647205402518684
Workaround tried:
- Invalidate metadata or refresh - works for few mins until next commit occurs and then throws same error with new file and new snapshot id
- Tried setting below as tblproperties but no help
ALTER TABLE db.table_name SET TBLPROPERTIES (
'metadata_refresh_interval_ms' = '60000',
'refresh-before-read' = 'true'
);
- Even tried to understand whether below properties have any impact but seems like no
write.metadata.delete-after-commit.enabled
write.metadata.previous-versions-max - unable to understand why this issue is poping where as iceberg maintains isolation. Where as same table can be queried via spark3-shell
- Also with same table properties some tables which gets data from same pipeline with same interval i am able to query successfully but not for few tables
Any solution would be of great help.