Dear Impala Community,
I am benchmarking TPC-H (http://www.tpc.org/tpch/) on an Impala cluster with 6 nodes, each having 250GB main memory, 20 physical CPU cores. The nodes are connected with Infiniband (IPoIB).
We use the TPC-H dataset at a scale factor of 100, such that the total size of all data is roughly 100GB. All data is loaded into parquet files. We compute statistics for each table.
Interestingly, when enabling the HDFS cache and loading the data fully into the HDFS cache, most queries get slightly slower. Due to the high main memory available in the cluster the files would fit into the filesystem cache anyway, but enabling the HDFS cache should not worsen the performance? Especially as I understood that zero copy reads are supposed to speed things up.
What could be the problem of this slight performance drop?
What is your cache replication and default HDFS replication? The default value for HDFS cache I believe is 1 copy.
So wich caching enabled, the data for your queries is retrieved from 1 location - where it resides in cache memory.
With caching disabled, the data for your queries is retrieved from (typically) multiplate locations - where in this case due to your high memory, it is residing in OS cache.
At least per my understanding, the latter of those two will give you more parallelism and result in faster queries despite the small performance loss of OS cache vs. HDFS cache. This is what I saw in my own testing as well.
The real advantage is if you do not have an abundance of memory, or you have other large processes that result in OS cache churn. If your data is constantly being kicked out of OS cache, and the queries are constantly going to disk to re-fetch the data - then having a "reserved" memory chunk via HDFS cache will give you a significant performance increase (despite the reduced # of copies stored in memory).
Again, this is at least per my understanding.
Perhaps re-test after rebooting your nodes, or flushing your OS cache. Then you can at least compare "Nothing in OS cache" vs. "Reserved HDFS cache pool". Alternatively set your HDFS cache replication to match your HDFS replication; although if everything fits in memory and there is no churn the # delta between tests is probably goign to be minimal.
Hope that helps.