Created on 12-16-2020 08:37 AM - edited 12-16-2020 08:42 AM
Hello guys, we are experiencing slow hdfs scan issue i.e. there a query run on 17 nodes and in query profile some nodes a way slower than the other ones.
Switching off the slowest node just causes the other node to become the slowest whereas it used to be the average.
We are using CDH 5.10.
Here are parts of the query profile with problematic nodes. I've taken the slowest and the fastest one:
THE SLOWEST NODE | THE FASTEST NODE |
Hdfs Read Thread Concurrency Bucket: 0:56.2% 1:36.5% 2:7.299% 3:0% 4:0% 5:0% 6:0% 7:0% File Formats: PARQUET/SNAPPY:4025 AverageHdfsReadThreadConcurrency: 0.51 AverageScannerThreadConcurrency: 2.72 BytesRead: 2.7 GiB BytesReadDataNodeCache: 0 B BytesReadLocal: 2.7 GiB BytesReadRemoteUnexpected: 0 B BytesReadShortCircuit: 2.7 GiB DecompressionTime: 4.25s InactiveTotalTime: 0ns MaxCompressedTextFileLength: 0 B NumColumns: 23 NumDisksAccessed: 4 NumRowGroups: 175 NumScannerThreadsStarted: 3 PeakMemoryUsage: 133.9 MiB PerReadThreadRawHdfsThroughput: 65.6 MiB/s RemoteScanRanges: 0 RowBatchQueueGetWaitTime: 1.1m RowBatchQueuePutWaitTime: 0ns RowsRead: 192,701,761 RowsReturned: 18,748 RowsReturnedRate: 294 per second ScanRangesComplete: 175 ScannerThreadsInvoluntaryContextSwitches: 24,401 ScannerThreadsTotalWallClockTime: 3.1m MaterializeTupleTime(): 1.9m ScannerThreadsSysTime: 17.81s ScannerThreadsUserTime: 1.6m ScannerThreadsVoluntaryContextSwitches: 19,141 TotalRawHdfsReadTime(): 41.81s TotalReadThroughput: 39.7 MiB/s TotalTime: 1.1m | Hdfs Read Thread Concurrency Bucket: 0:86.84% 1:13.16% 2:0% 3:0% 4:0% 5:0% 6:0% 7:0% File Formats: PARQUET/SNAPPY:4209 AverageHdfsReadThreadConcurrency: 0.13 AverageScannerThreadConcurrency: 5.92 BytesRead: 2.7 GiB BytesReadDataNodeCache: 0 B BytesReadLocal: 2.7 GiB BytesReadRemoteUnexpected: 0 B BytesReadShortCircuit: 2.7 GiB DecompressionTime: 3.50s InactiveTotalTime: 0ns MaxCompressedTextFileLength: 0 B NumColumns: 23 NumDisksAccessed: 4 NumRowGroups: 183 NumScannerThreadsStarted: 6 PeakMemoryUsage: 329.9 MiB PerReadThreadRawHdfsThroughput: 529.1 MiB/s RemoteScanRanges: 0 RowBatchQueueGetWaitTime: 14.64s RowBatchQueuePutWaitTime: 0ns RowsRead: 192,490,029 RowsReturned: 21,148 RowsReturnedRate: 1437 per second ScanRangesComplete: 183 ScannerThreadsInvoluntaryContextSwitches: 7,158 ScannerThreadsTotalWallClockTime: 1.9m MaterializeTupleTime(): 1.5m ScannerThreadsSysTime: 13.94s ScannerThreadsUserTime: 1.2m ScannerThreadsVoluntaryContextSwitches: 47,621 TotalRawHdfsReadTime: 5.15s TotalReadThroughput: 143.4 MiB/s TotalTime: 14.71s |
The other nodes were about 20 seconds on the average and the slowest one was far away from any other.
As we notice BytesReadLocal is the same, and it is the same on the rest of the nodes. But PerReadThreadRawHdfsThroughput, RowBatchQueueGetWaitTime, RowsReturnedRate, ScannerThreadsInvoluntaryContextSwitches are very much different.
The strange part is the same node can be fast in terms of query/throughput until we stop impalad on the node being the slowest one.
Then with almost the same amount of data scanned (removing a single node is changing BytesReadLocalis from e.g. 2.5G to 2.7G ) the speed degrades dramatically like 2 or 3 times worse.
Does anyone have an idea what could be wrong?
Thanks in advance!
Created 12-16-2020 01:47 PM
One difference is how fast it's reading from disk - i.e. TotalRawHdfsReadTime. In CDH5.12 that includes both time spend fetching metadata from the HDFS namenode and actually reading the data off disk. If you're saying that it's only slow on one node, that probably rules out HDFS namenode slowness, which is a common cause. So probably it's actually slower doing the I/O. Note: in CDH5.15 we split out the namenode RPC time into TotalRawHdfsOpenTime to make it easier to debug things like this.
I don't know exactly why I/O would be slower on that one node, it might require inspecting the host to see what's happening and if there's more CPU or I/O load on that host.
We've seen that happen if a node is more heavily loaded than other nodes because of some kind of uneven data distribution. E.g. one file is very frequently accessed, maybe if there's a dimension table that is referenced in many queries. That can sometimes be addressed by setting SCHEDULE_RANDOM_REPLICA as a query hint or query option https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_hints.html or https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_schedule_random_replica.html...
Or even by enabling HDFS caching for the problematic table (HDFS caching spreads load across all cached replicas).
Another possible cause, based on that profile, is that it's competing for scanner threads with other queries running on the same node - AverageScannerThreadConcurrency is lower in the slow case. This can either be because other concurrent queries grabbed scanner threads first (there's a global soft limit of 3x # cpus per node) or because
Created 12-17-2020 06:36 AM
hi Tim,
thanks for your response.
Was it truncated by chance?
Created 12-17-2020 01:09 PM
"... or because memory is running low or because the extra scanner threads are not needed to keep up with the consumer from the scan node." is how I should've finished that.
Not sure what happened there, I did hit the main points but truncated the last sentence.
That part of Impala has had a lot of improvements for performance and observability (i.e. more info in the profile) since CDH5.10, FWIW, I'd guess on a later version this wouldn't be a problem or would be easier to debug at least.
Created 12-18-2020 06:26 AM
Thanks for coming back, Tim!
Created 12-18-2020 06:33 AM
We have restarted nearly every component of the affected HDFS cluster and impala performance has improved. Sadly that doesn't explain the underlying issue.