Support Questions

Find answers, ask questions, and share your expertise

After Impala Refresh Metadata is still stale

avatar
Contributor

I have an Oozie Workflow where I am have a job job which loads some data into a table, I refresh the table in Impala and then have an Impala query to export the most recent data in this table to a CSV File.

 

My Problem is that even after doing the Impala refresh I do not get the most recent data only the data for the previous load.

 

For Example I have a process that starts running at 1pm spark job finishes at 1:15pm impala refresh is executed 1:20pm then at 1:25 my query to export the data runs but it only shows the data for the previous workflow which run at 12pm and not the data for the workflow which ran at 1pm.

 

I am using Oozie and cdh 5.15.1.

 

Sample Warning Message Read 972.32 MB of data across network that was expected to be local. Block locality metadata for table '..' may be stale. Consider running "INVALIDATE

METADATA ...

Thanks

3 REPLIES 3

avatar
Super Guru
@gimp077 ,

When you say you did "REFRESH" the table, did you run "REFRESH <tablename>" or "INVALIDATE METADATA", because those two are not identical in the way they work.

Is your table partitioned? If yes, can you see the new partition from Impala by running 'SHOW PARTITIONS <tablename>"?

Cheers
Eric

avatar
Contributor

Hi Eric,

 

My table is partitioned I was expecting that after I do a refresh on the table I would see the most recent data in the table.

 

However sometimes there is a lag from when the refresh completes to when I see the most recent data.

I think invalidate metadata would fix this issue but it will be costly to run on a large table.

 

Thanks

avatar
Super Guru
@gimp077 ,

Did you mean that "REFRESH" takes time, and eventually you can see the update data, but just some delay?

How big is the table? I mean in terms of number of partitions and number of files in HDFS?

Eric