Support Questions

scott.bader-1997506906 · ‎06-01-2015

Refresh the Impala metadata from Hive Metastore? (Invalidate Metadata / Refresh )
- From what we analyzed, Invalidate metadata is a costly operation and in the scenario of adding new data files to an existing table, we can do a REFRESH rather than INVALIDATE METADATA
- What is considered Best Practice?

jholoman · ‎06-19-2015

Scott,

I'll refer you to the documentation on this topic here:

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_refre...

and

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_inval...

In terms of "best practice":

Use the REFRESH statement to load the latest metastore metadata and block location data for a particular table in these scenarios:

After loading new data files into the HDFS data directory for the table. (Once you have set up an ETL pipeline to bring data into Impala on a regular basis, this is typically the most frequent reason why metadata needs to be refreshed.)
After issuing ALTER TABLE, INSERT, LOAD DATA, or other table-modifying SQL statement in Hive.

INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA waits to reload the metadata when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall. If data was altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is optimized for the common use case of adding new data files to an existing table, thus the table name argument is now required.

Let me know if this doesn't answer your question.

Thanks

Jeff

View solution in original post

jholoman · ‎06-19-2015

Scott,

I'll refer you to the documentation on this topic here:

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_refre...

and

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_inval...

In terms of "best practice":

Use the REFRESH statement to load the latest metastore metadata and block location data for a particular table in these scenarios:

After loading new data files into the HDFS data directory for the table. (Once you have set up an ETL pipeline to bring data into Impala on a regular basis, this is typically the most frequent reason why metadata needs to be refreshed.)
After issuing ALTER TABLE, INSERT, LOAD DATA, or other table-modifying SQL statement in Hive.

INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA waits to reload the metadata when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall. If data was altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is optimized for the common use case of adding new data files to an existing table, thus the table name argument is now required.

Let me know if this doesn't answer your question.

Thanks

Jeff

cjervis · ‎02-22-2016

Thanks for the great question! We have even created a Community Knowledge Article based on this thread.

Cy Jervis, Manager, Community Program
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

thewayofthinkin · ‎02-23-2016

Jeff,

I understand what you explained.

However, what if HDFS rebalances data automatically?

If this scenario, it seems there is only one option left using INVALIDATE METADATA.

Support Questions

Refresh the Impala metadata from Hive Metastore?