Reply
New Contributor
Posts: 5
Registered: ‎11-11-2015
Accepted Solution

Impala doesn't detect file changes automatically using HDFS Cache

CDH 5.4.5

Impala 2.2.0-cdh5.4.5

 

According to Using HDFS Caching with Impala (CDH 5.1 or higher only): "When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached. " But in reality enabling HDFS caching does not enable automatic-detection of file changes. REFRESH has to be called specifically to see the new data.

 

Reproduction following the instructions on Using HDFS Caching with Impala (CDH 5.1 or higher only):

  1. create a cache pool:
    hdfs cacheadmin -addPool test_pool -owner impala -limit 1048576
  2. add table to cache pool: e.g. test_table with  location '/tmp/cache_test'
    alter table test_table set cached in 'test_pool';
  3. check stats in cacheadmin and table stats: (everything ok)
    hdfs cacheadmin -listPools -stats
    # impala-shell:
    show table stats test_table;
  4. put new file to cached partition: e.g. hdfs dfs -put test.tsv /tmp/cache_test
  5. now cacheadmin detects the change and updated the FILES_NEEDED and BYTES_NEEDED and after a while BYTES_CACHED and FILES_CACHES are updated
  6. but impala 'show table stats' still has the same numbers
  7. stats only updated after calling 'refresh test_table;'

 

Question: is the statement in the documentation correct that impala can automatically detect file changes if HDFS caching is used?

 

 

I have tried so far:

  • ... with external and internal impala tables
  • ... cache the entire table or specific partiions
  • and waited for a long time.

 

Explorer
Posts: 16
Registered: ‎11-13-2014

Re: Impala doesn't detect file changes automatically using HDFS Cache

Can someone chime in and say the documentation is wrong please?

 

The documentation clearly says "When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached. "

 

Are we doing it wrong, or are the Cloudera documents false and misleading?

 

Thanks.

Cloudera Employee
Posts: 307
Registered: ‎10-16-2013

Re: Impala doesn't detect file changes automatically using HDFS Cache

Thanks for the report. The documentation is wrong.

 

A REFRESH is always needed to pick up changes to table data that were made from tools other than Impala.

 

I'll follow up with our docs team to get this fixed.

Explorer
Posts: 16
Registered: ‎11-13-2014

Re: Impala doesn't detect file changes automatically using HDFS Cache

Thanks for the update.
Highlighted
Explorer
Posts: 16
Registered: ‎11-13-2014

Re: Impala doesn't detect file changes automatically using HDFS Cache

Thanks again, and please be aware the incorrect text is also found here:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_hdfs_caching.html

 

"When data is requested to be pinned in memory, that process happens in the background without blocking access to the data while the caching is in progress. Loading the data from disk could take some time. Impala reads each HDFS data block from memory if it has been pinned already, or from disk if it has not been pinned yet. When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached."