Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Impala doesn't detect file changes automatically using HDFS Cache

avatar
Explorer

CDH 5.4.5

Impala 2.2.0-cdh5.4.5

 

According to Using HDFS Caching with Impala (CDH 5.1 or higher only): "When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached. " But in reality enabling HDFS caching does not enable automatic-detection of file changes. REFRESH has to be called specifically to see the new data.

 

Reproduction following the instructions on Using HDFS Caching with Impala (CDH 5.1 or higher only):

  1. create a cache pool:
    hdfs cacheadmin -addPool test_pool -owner impala -limit 1048576
  2. add table to cache pool: e.g. test_table with  location '/tmp/cache_test'
    alter table test_table set cached in 'test_pool';
  3. check stats in cacheadmin and table stats: (everything ok)
    hdfs cacheadmin -listPools -stats
    # impala-shell:
    show table stats test_table;
  4. put new file to cached partition: e.g. hdfs dfs -put test.tsv /tmp/cache_test
  5. now cacheadmin detects the change and updated the FILES_NEEDED and BYTES_NEEDED and after a while BYTES_CACHED and FILES_CACHES are updated
  6. but impala 'show table stats' still has the same numbers
  7. stats only updated after calling 'refresh test_table;'

 

Question: is the statement in the documentation correct that impala can automatically detect file changes if HDFS caching is used?

 

 

I have tried so far:

  • ... with external and internal impala tables
  • ... cache the entire table or specific partiions
  • and waited for a long time.

 

1 ACCEPTED SOLUTION

avatar
Contributor
Thanks for the update.

View solution in original post

4 REPLIES 4

avatar
Contributor

Can someone chime in and say the documentation is wrong please?

 

The documentation clearly says "When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached. "

 

Are we doing it wrong, or are the Cloudera documents false and misleading?

 

Thanks.

avatar

Thanks for the report. The documentation is wrong.

 

A REFRESH is always needed to pick up changes to table data that were made from tools other than Impala.

 

I'll follow up with our docs team to get this fixed.

avatar
Contributor
Thanks for the update.

avatar
Contributor

Thanks again, and please be aware the incorrect text is also found here:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_hdfs_caching.html

 

"When data is requested to be pinned in memory, that process happens in the background without blocking access to the data while the caching is in progress. Loading the data from disk could take some time. Impala reads each HDFS data block from memory if it has been pinned already, or from disk if it has not been pinned yet. When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached."