Created on 11-11-2015 11:22 AM - edited 09-16-2022 02:49 AM
CDH 5.4.5
Impala 2.2.0-cdh5.4.5
According to Using HDFS Caching with Impala (CDH 5.1 or higher only): "When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached. " But in reality enabling HDFS caching does not enable automatic-detection of file changes. REFRESH has to be called specifically to see the new data.
Reproduction following the instructions on Using HDFS Caching with Impala (CDH 5.1 or higher only):
hdfs cacheadmin -addPool test_pool -owner impala -limit 1048576
alter table test_table set cached in 'test_pool';
hdfs cacheadmin -listPools -stats # impala-shell: show table stats test_table;
Question: is the statement in the documentation correct that impala can automatically detect file changes if HDFS caching is used?
I have tried so far:
Created 10-10-2016 09:40 AM
Created 10-07-2016 03:43 PM
Can someone chime in and say the documentation is wrong please?
The documentation clearly says "When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached. "
Are we doing it wrong, or are the Cloudera documents false and misleading?
Thanks.
Created 10-07-2016 09:28 PM
Thanks for the report. The documentation is wrong.
A REFRESH is always needed to pick up changes to table data that were made from tools other than Impala.
I'll follow up with our docs team to get this fixed.
Created 10-10-2016 09:40 AM
Created 01-26-2017 02:13 PM
Thanks again, and please be aware the incorrect text is also found here:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_hdfs_caching.html
"When data is requested to be pinned in memory, that process happens in the background without blocking access to the data while the caching is in progress. Loading the data from disk could take some time. Impala reads each HDFS data block from memory if it has been pinned already, or from disk if it has not been pinned yet. When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached."