I have a qusetion about these values in impala-server metric.
I set `max_cached_file_handles` to 10,000
|max_cached_file_handles (uint64)||Maximum number of HDFS file handles that will be cached. Disabled if set to 0.||0||10000|
However, there are still missed `impala-server.io.mgr.cached-file-handles-miss-count`.
|impala-server.io.mgr.cached-file-handles-hit-count||668||Number of cache hits for cached HDFS file handles|
|impala-server.io.mgr.num-cached-file-handles||253||Number of currently cached HDFS file handles in the IO manager.|
|impala-server.io.mgr.cached-file-handles-miss-count||1467||Number of cache misses for cached HDFS file handles|
Can you tell me how to reduce the miss-count?
Cache misses are common when the cache is warming up. When Impala requests a file handle for the first time, it will not be the cache and Impala needs to open the file handle. It looks like this is what is happening in your case, as the cache is not full. Over time, the ratio of hits to misses will go up as the cache contains more of the file handles that Impala is accessing.
A few things to know:
1. If the cache gets full, then the cache will start evicting the least recently used file handles. If a workload then needs file handle that was evicted, then it will cause a miss again. This is not happening in your case, as the cache is not full.
2. Impala will often have multiple file handles open for the same file, because it is accessing the file from multiple places in multiple threads. This means that the cache will need multiple file handles for the same file. So, the initial number of misses as the cache warms up can exceed the number of files that you are accessing.
I hope this helps.
Thank you for your comment.
Your comment really help me confirm and understand how file-handle-cache works.
Like you said, over time, the hit-count is going up and miss-cout is becoming stable :)
By the way, do you think one cached file handle is used by multiple threads?
A query gets a file handle when it starts processing a file and holds the handle until it is done with the file. Only one thread issues IO on a file handle at a time. When a file handle returns to the cache, it can be picked up by any thread that needs to access the that file.
ah. i see
so, at any given time, a file handle is used by only one thread.
it means a file handle is not used by multiple threads at the same time.
I'm very wonrdering how you know about this very well :)
Thank you very much