Support Questions

hl-man · ‎12-16-2020

Hello guys, we are experiencing slow hdfs scan issue i.e. there a query run on 17 nodes and in query profile some nodes a way slower than the other ones.
Switching off the slowest node just causes the other node to become the slowest whereas it used to be the average.

We are using CDH 5.10.

Here are parts of the query profile with problematic nodes. I've taken the slowest and the fastest one:

THE SLOWEST NODE

THE FASTEST NODE

Hdfs Read Thread Concurrency Bucket: 0:56.2% 1:36.5% 2:7.299% 3:0% 4:0% 5:0% 6:0% 7:0% File Formats: PARQUET/SNAPPY:4025 AverageHdfsReadThreadConcurrency: 0.51 AverageScannerThreadConcurrency: 2.72

BytesRead: 2.7 GiB

BytesReadDataNodeCache: 0 B

BytesReadLocal: 2.7 GiB BytesReadRemoteUnexpected: 0 B BytesReadShortCircuit: 2.7 GiB DecompressionTime: 4.25s

InactiveTotalTime: 0ns

MaxCompressedTextFileLength: 0 B

NumColumns: 23

NumDisksAccessed: 4

NumRowGroups: 175 NumScannerThreadsStarted: 3 PeakMemoryUsage: 133.9 MiB PerReadThreadRawHdfsThroughput: 65.6 MiB/s

RemoteScanRanges: 0

RowBatchQueueGetWaitTime: 1.1m RowBatchQueuePutWaitTime: 0ns

RowsRead: 192,701,761

RowsReturned: 18,748

RowsReturnedRate: 294 per second ScanRangesComplete: 175 ScannerThreadsInvoluntaryContextSwitches: 24,401 ScannerThreadsTotalWallClockTime: 3.1m MaterializeTupleTime(): 1.9m ScannerThreadsSysTime: 17.81s ScannerThreadsUserTime: 1.6m ScannerThreadsVoluntaryContextSwitches: 19,141 TotalRawHdfsReadTime(): 41.81s TotalReadThroughput: 39.7 MiB/s

TotalTime: 1.1m

Hdfs Read Thread Concurrency Bucket: 0:86.84% 1:13.16% 2:0% 3:0% 4:0% 5:0% 6:0% 7:0% File Formats: PARQUET/SNAPPY:4209 AverageHdfsReadThreadConcurrency: 0.13 AverageScannerThreadConcurrency: 5.92

BytesRead: 2.7 GiB

BytesReadDataNodeCache: 0 B

BytesReadLocal: 2.7 GiB BytesReadRemoteUnexpected: 0 B BytesReadShortCircuit: 2.7 GiB

DecompressionTime: 3.50s

InactiveTotalTime: 0ns

MaxCompressedTextFileLength: 0 B

NumColumns: 23

NumDisksAccessed: 4

NumRowGroups: 183 NumScannerThreadsStarted: 6 PeakMemoryUsage: 329.9 MiB PerReadThreadRawHdfsThroughput: 529.1 MiB/s

RemoteScanRanges: 0

RowBatchQueueGetWaitTime: 14.64s

RowBatchQueuePutWaitTime: 0ns

RowsRead: 192,490,029

RowsReturned: 21,148

RowsReturnedRate: 1437 per second ScanRangesComplete: 183 ScannerThreadsInvoluntaryContextSwitches: 7,158

ScannerThreadsTotalWallClockTime: 1.9m MaterializeTupleTime(): 1.5m ScannerThreadsSysTime: 13.94s ScannerThreadsUserTime: 1.2m ScannerThreadsVoluntaryContextSwitches: 47,621

TotalRawHdfsReadTime: 5.15s TotalReadThroughput: 143.4 MiB/s

TotalTime: 14.71s

The other nodes were about 20 seconds on the average and the slowest one was far away from any other.

As we notice BytesReadLocal is the same, and it is the same on the rest of the nodes. But PerReadThreadRawHdfsThroughput, RowBatchQueueGetWaitTime, RowsReturnedRate, ScannerThreadsInvoluntaryContextSwitches are very much different.

The strange part is the same node can be fast in terms of query/throughput until we stop impalad on the node being the slowest one.
Then with almost the same amount of data scanned (removing a single node is changing BytesReadLocalis from e.g. 2.5G to 2.7G ) the speed degrades dramatically like 2 or 3 times worse.

Does anyone have an idea what could be wrong?

Thanks in advance!

Tim Armstrong · ‎12-16-2020

One difference is how fast it's reading from disk - i.e. TotalRawHdfsReadTime. In CDH5.12 that includes both time spend fetching metadata from the HDFS namenode and actually reading the data off disk. If you're saying that it's only slow on one node, that probably rules out HDFS namenode slowness, which is a common cause. So probably it's actually slower doing the I/O. Note: in CDH5.15 we split out the namenode RPC time into TotalRawHdfsOpenTime to make it easier to debug things like this.

I don't know exactly why I/O would be slower on that one node, it might require inspecting the host to see what's happening and if there's more CPU or I/O load on that host.

We've seen that happen if a node is more heavily loaded than other nodes because of some kind of uneven data distribution. E.g. one file is very frequently accessed, maybe if there's a dimension table that is referenced in many queries. That can sometimes be addressed by setting SCHEDULE_RANDOM_REPLICA as a query hint or query option https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_hints.html or https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_schedule_random_replica.html...
Or even by enabling HDFS caching for the problematic table (HDFS caching spreads load across all cached replicas).

Another possible cause, based on that profile, is that it's competing for scanner threads with other queries running on the same node - AverageScannerThreadConcurrency is lower in the slow case. This can either be because other concurrent queries grabbed scanner threads first (there's a global soft limit of 3x # cpus per node) or because

hl-man · ‎12-17-2020

hi Tim,
thanks for your response.
Was it truncated by chance?

Tim Armstrong · ‎12-17-2020

"... or because memory is running low or because the extra scanner threads are not needed to keep up with the consumer from the scan node." is how I should've finished that.

Not sure what happened there, I did hit the main points but truncated the last sentence.

That part of Impala has had a lot of improvements for performance and observability (i.e. more info in the profile) since CDH5.10, FWIW, I'd guess on a later version this wouldn't be a problem or would be easier to debug at least.

hl-man · ‎12-18-2020

Thanks for coming back, Tim!

hl-man · ‎12-18-2020

We have restarted nearly every component of the affected HDFS cluster and impala performance has improved. Sadly that doesn't explain the underlying issue.

Cloudera Community

Support Questions

Some nodes are way slower on HDFS scan then the other ones during impala SQL query