I have an Oozie Workflow where I am have a job job which loads some data into a table, I refresh the table in Impala and then have an Impala query to export the most recent data in this table to a CSV File.
My Problem is that even after doing the Impala refresh I do not get the most recent data only the data for the previous load.
For Example I have a process that starts running at 1pm spark job finishes at 1:15pm impala refresh is executed 1:20pm then at 1:25 my query to export the data runs but it only shows the data for the previous workflow which run at 12pm and not the data for the workflow which ran at 1pm.
I am using Oozie and cdh 5.15.1.
Sample Warning Message Read 972.32 MB of data across network that was expected to be local. Block locality metadata for table '..' may be stale. Consider running "INVALIDATE
My table is partitioned I was expecting that after I do a refresh on the table I would see the most recent data in the table.
However sometimes there is a lag from when the refresh completes to when I see the most recent data.
I think invalidate metadata would fix this issue but it will be costly to run on a large table.