question Re: Some nodes are way slower on HDFS scan then the other ones during impala SQL query in Support Questions

question Re: Some nodes are way slower on HDFS scan then the other ones during impala SQL query in Support Questions https://community.cloudera.com/t5/Support-Questions/Some-nodes-are-way-slower-on-HDFS-scan-then-the-other-ones/m-p/307885#M223383 One difference is how fast it's reading from disk - i.e. TotalRawHdfsReadTime. In CDH5.12 that includes both time spend fetching metadata from the HDFS namenode and actually reading the data off disk. If you're saying that it's only slow on one node, that probably rules out HDFS namenode slowness, which is a common cause. So probably it's actually slower doing the I/O. Note: in CDH5.15 we split out the namenode RPC time into TotalRawHdfsOpenTime to make it easier to debug things like this. I don't know exactly why I/O would be slower on that one node, it might require inspecting the host to see what's happening and if there's more CPU or I/O load on that host. We've seen that happen if a node is more heavily loaded than other nodes because of some kind of uneven data distribution. E.g. one file is very frequently accessed, maybe if there's a dimension table that is referenced in many queries. That can sometimes be addressed by setting SCHEDULE_RANDOM_REPLICA as a query hint or query option <A href="https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_hints.html" target="_blank" rel="noopener">https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_hints.html</A> or <A href="https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_schedule_random_replica.html" target="_blank" rel="noopener">https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_schedule_random_replica.html.</A> Or even by enabling HDFS caching for the problematic table (HDFS caching spreads load across all cached replicas). Another possible cause, based on that profile, is that it's competing for scanner threads with other queries running on the same node - AverageScannerThreadConcurrency is lower in the slow case. This can either be because other concurrent queries grabbed scanner threads first (there's a global soft limit of 3x # cpus per node) or because Wed, 16 Dec 2020 21:47:25 GMT Tim Armstrong 2020-12-16T21:47:25Z