I'm having a hard time to understand the difference between Impala's refresh command vs the compute stats command.
I understand that 'refresh' command refreshes the metadata of a database / table and 'compute stats' calculates the volume of data and its distribution, but my confusion is, isn't this re-calculation already done within the 'refresh' command?
My understanding might be completely wrong, hence reaching out to the SMEs.
Can anyone please help me explain when to use 'refresh' and when to use 'compute stats'?
REFRESH in the common case where you add new data files for an existing table it reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall.
It is recommended to run COMPUTE STATS when 30 % of data is altered in a table, where altered means the addition or deletion of files/data.
INVALIDATE METADATA is a relatively expensive operation compared to the incremental metadata update done by the REFRESH statement, so in the common scenario of adding new data files to an existing table, prefer REFRESH rather than INVALIDATE METADATA which marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceed.