Support Questions

Find answers, ask questions, and share your expertise

Who agreed with this topic

Impala doesn't detect file changes automatically using HDFS Cache

avatar
Explorer

CDH 5.4.5

Impala 2.2.0-cdh5.4.5

 

According to Using HDFS Caching with Impala (CDH 5.1 or higher only): "When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached. " But in reality enabling HDFS caching does not enable automatic-detection of file changes. REFRESH has to be called specifically to see the new data.

 

Reproduction following the instructions on Using HDFS Caching with Impala (CDH 5.1 or higher only):

  1. create a cache pool:
    hdfs cacheadmin -addPool test_pool -owner impala -limit 1048576
  2. add table to cache pool: e.g. test_table with  location '/tmp/cache_test'
    alter table test_table set cached in 'test_pool';
  3. check stats in cacheadmin and table stats: (everything ok)
    hdfs cacheadmin -listPools -stats
    # impala-shell:
    show table stats test_table;
  4. put new file to cached partition: e.g. hdfs dfs -put test.tsv /tmp/cache_test
  5. now cacheadmin detects the change and updated the FILES_NEEDED and BYTES_NEEDED and after a while BYTES_CACHED and FILES_CACHES are updated
  6. but impala 'show table stats' still has the same numbers
  7. stats only updated after calling 'refresh test_table;'

 

Question: is the statement in the documentation correct that impala can automatically detect file changes if HDFS caching is used?

 

 

I have tried so far:

  • ... with external and internal impala tables
  • ... cache the entire table or specific partiions
  • and waited for a long time.

 

Who agreed with this topic