Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Who agreed with this topic

Impala doesn't detect file changes automatically using HDFS Cache

avatar
Explorer

CDH 5.4.5

Impala 2.2.0-cdh5.4.5

 

According to Using HDFS Caching with Impala (CDH 5.1 or higher only): "When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached. " But in reality enabling HDFS caching does not enable automatic-detection of file changes. REFRESH has to be called specifically to see the new data.

 

Reproduction following the instructions on Using HDFS Caching with Impala (CDH 5.1 or higher only):

  1. create a cache pool:
    hdfs cacheadmin -addPool test_pool -owner impala -limit 1048576
  2. add table to cache pool: e.g. test_table with  location '/tmp/cache_test'
    alter table test_table set cached in 'test_pool';
  3. check stats in cacheadmin and table stats: (everything ok)
    hdfs cacheadmin -listPools -stats
    # impala-shell:
    show table stats test_table;
  4. put new file to cached partition: e.g. hdfs dfs -put test.tsv /tmp/cache_test
  5. now cacheadmin detects the change and updated the FILES_NEEDED and BYTES_NEEDED and after a while BYTES_CACHED and FILES_CACHES are updated
  6. but impala 'show table stats' still has the same numbers
  7. stats only updated after calling 'refresh test_table;'

 

Question: is the statement in the documentation correct that impala can automatically detect file changes if HDFS caching is used?

 

 

I have tried so far:

  • ... with external and internal impala tables
  • ... cache the entire table or specific partiions
  • and waited for a long time.

 

Who agreed with this topic