Member since
11-11-2015
5
Posts
0
Kudos Received
0
Solutions
11-11-2015
11:22 AM
CDH 5.4.5 Impala 2.2.0-cdh5.4.5 According to Using HDFS Caching with Impala (CDH 5.1 or higher only): "When files are added to a table or partition whose contents are cached, Impala automatically detects those changes and performs a REFRESH automatically once the relevant data is cached. " But in reality enabling HDFS caching does not enable automatic-detection of file changes. REFRESH has to be called specifically to see the new data. Reproduction following the instructions on Using HDFS Caching with Impala (CDH 5.1 or higher only): create a cache pool: hdfs cacheadmin -addPool test_pool -owner impala -limit 1048576 add table to cache pool: e.g. test_table with location '/tmp/cache_test' alter table test_table set cached in 'test_pool'; check stats in cacheadmin and table stats: (everything ok) hdfs cacheadmin -listPools -stats
# impala-shell:
show table stats test_table; put new file to cached partition: e.g. hdfs dfs -put test.tsv /tmp/cache_test now cacheadmin detects the change and updated the FILES_NEEDED and BYTES_NEEDED and after a while BYTES_CACHED and FILES_CACHES are updated but impala 'show table stats' still has the same numbers stats only updated after calling 'refresh test_table;' Question: is the statement in the documentation correct that impala can automatically detect file changes if HDFS caching is used? I have tried so far: ... with external and internal impala tables ... cache the entire table or specific partiions and waited for a long time.
... View more
Labels: