Created 09-29-2022 07:36 AM
Most of the same Impala queries respond within 20 seconds. But sometimes it takes more than 60 seconds.
I see a hotspot in 'SCAN HDFS' of 'ExecSummary' in slow query profile. (avgTime 6s462ms, maxTime 54s766ms )
low response time Query Info:
- Duration : 13.1s
- HDFS Scanner Average Read Throughput: 382.2MiB/s
- Threads: Network Receive Wait Time: 10.0m
high response time Query Info:
- Duration : 1m, 4s
- HDFS Scanner Average Read Throughput : 988.1 MiB/s
- Threads: Network Receive Wait Time: 59.5m
If HDFS SCAN is the cause of the slow response time, is the 'HDFS Scanner Average Read Throughput' value higher in the slow query?
And one more question.
I compared the fast and slow operators of the HDFS_SCAN_NODE stage in the slow query profile.
RowBatchQueueGetWaitTime(54.76s) and ScannerThreadsTotalWallClockTime(54.77s) I wonder why the two metric values are high.
HDFS_SCAN_NODE (id=0)
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 1:1/4.72 MB
ExecOption: PARQUET Codegen Enabled, Codegen enabled: 1 out of 1
Hdfs Read Thread Concurrency Bucket: 0:100% 1:0% 2:0% 3:0% 4:0% 5:0% 6:0% 7:0% 8:0% 9:0% 10:0%
File Formats: PARQUET/SNAPPY:1
- AverageHdfsReadThreadConcurrency: 0.00 (0.0)
- AverageScannerThreadConcurrency: 1.00 (1.0)
- BytesRead: 959.6 KiB (982621)
- BytesReadDataNodeCache: 0 B (0)
- BytesReadLocal: 959.6 KiB (982621)
- BytesReadRemoteUnexpected: 0 B (0)
- BytesReadShortCircuit: 959.6 KiB (982621)
- CachedFileHandlesHitCount: 0 (0)
- CachedFileHandlesMissCount: 3 (3)
- CollectionItemsRead: 0 (0)
- DecompressionTime: 2ms (2131729)
- InactiveTotalTime: 0ns (0)
- MaxCompressedTextFileLength: 0 B (0)
- NumColumns: 2 (2)
- NumDictFilteredRowGroups: 0 (0)
- NumDisksAccessed: 1 (1)
- NumRowGroups: 1 (1)
- NumScannerThreadsStarted: 1 (1)
- NumScannersWithNoReads: 0 (0)
- NumStatsFilteredRowGroups: 0 (0)
- PeakMemoryUsage: 3.3 MiB (3430040)
- PerReadThreadRawHdfsThroughput: 1,011.8 MiB/s (1060904048)
- RemoteScanRanges: 0 (0)
- RowBatchQueueGetWaitTime: 54.76s (54763817619)
- RowBatchQueuePutWaitTime: 0ns (0)
- RowsRead: 52,319 (52319)
- RowsReturned: 52,319 (52319)
- RowsReturnedRate: 955 per second (955)
- ScanRangesComplete: 1 (1)
- ScannerThreadsInvoluntaryContextSwitches: 1 (1)
- ScannerThreadsTotalWallClockTime: 54.77s (54768078433)
- MaterializeTupleTime(*): 1ms (1634590)
- ScannerThreadsSysTime: 1ms (1897000)
- ScannerThreadsUserTime: 9ms (9870000)
- ScannerThreadsVoluntaryContextSwitches: 14 (14)
- TotalRawHdfsOpenFileTime(*): 3ms (3129744)
- TotalRawHdfsReadTime(*): 926.21us (926211)
- TotalReadThroughput: 10.8 KiB/s (11075)
- TotalTime: 54.77s (54766111629)
HDFS_SCAN_NODE (id=0)
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 3:1/4.70 MB
ExecOption: PARQUET Codegen Enabled, Codegen enabled: 1 out of 1
Hdfs Read Thread Concurrency Bucket: 0:0% 1:0% 2:0% 3:0% 4:0% 5:0% 6:0% 7:0% 8:0% 9:0% 10:0%
File Formats: PARQUET/SNAPPY:1
- AverageHdfsReadThreadConcurrency: 0.00 (0.0)
- AverageScannerThreadConcurrency: 0.00 (0.0)
- BytesRead: 956.9 KiB (979887)
- BytesReadDataNodeCache: 0 B (0)
- BytesReadLocal: 956.9 KiB (979887)
- BytesReadRemoteUnexpected: 0 B (0)
- BytesReadShortCircuit: 956.9 KiB (979887)
- CachedFileHandlesHitCount: 0 (0)
- CachedFileHandlesMissCount: 3 (3)
- CollectionItemsRead: 0 (0)
- DecompressionTime: 1ms (1175394)
- InactiveTotalTime: 0ns (0)
- MaxCompressedTextFileLength: 0 B (0)
- NumColumns: 2 (2)
- NumDictFilteredRowGroups: 0 (0)
- NumDisksAccessed: 1 (1)
- NumRowGroups: 1 (1)
- NumScannerThreadsStarted: 1 (1)
- NumScannersWithNoReads: 0 (0)
- NumStatsFilteredRowGroups: 0 (0)
- PeakMemoryUsage: 3.2 MiB (3392266)
- PerReadThreadRawHdfsThroughput: 1.1 GiB/s (1200419705)
- RemoteScanRanges: 0 (0)
- RowBatchQueueGetWaitTime: 8ms (8679355)
- RowBatchQueuePutWaitTime: 0ns (0)
- RowsRead: 52,115 (52115)
- RowsReturned: 52,115 (52115)
- RowsReturnedRate: 4868064 per second (4868064)
- ScanRangesComplete: 1 (1)
- ScannerThreadsInvoluntaryContextSwitches: 0 (0)
- ScannerThreadsTotalWallClockTime: 11ms (11603409)
- MaterializeTupleTime(*): 1ms (1465282)
- ScannerThreadsSysTime: 2ms (2006000)
- ScannerThreadsUserTime: 6ms (6044000)
- ScannerThreadsVoluntaryContextSwitches: 14 (14)
- TotalRawHdfsOpenFileTime(*): 2ms (2488801)
- TotalRawHdfsReadTime(*): 816.29us (816287)
- TotalReadThroughput: 0 B/s (0)
- TotalTime: 10ms (10705486)
Created 10-04-2022 01:13 PM
I think Network time is a symptom of slow I/O...Note the RowBatchQueueGetWaitTime, between the two scans...8ms in your fast scan, the slow is 56seconds.
Notes onRowBatchQueueGetWaitTime:
Impala scanners internally have a RowBatch queue that allows Impala to decouple I/O from CPU processing. The I/O threads read data into RowBatches and put them into a queue, CPU threads asynchronously fetch data from the queue and process them. RowBatchQueueGetWaitTime is the amount of time CPU threads wait on data to arrive into the queue. Essentially, it means the CPU threads were waiting a long time for the I/O threads to read the data.
Created on 10-03-2022 02:52 PM - edited 10-03-2022 02:53 PM
All other evidence being equal this scenario suggests very possible disk level problem on the slow host. You could try brute force and restart the Impala Daemon on that host or do some light diagnostics, like hdparm or fio, depending on how much time you want to invest.
host one
- RowsReturned: 52,115 (52115)
- RowsReturnedRate: 4868064 per second (4868064)
host 2
- RowsReturned: 52,319 (52319)
- RowsReturnedRate: 955 per second (955)
If these obvious disk checks don't show up anything useful , send full query profile of a fast query and the slow.
Created 10-03-2022 06:48 PM
Is 'RowsReturnedRate' data read from disk?
Compared to the fast 'HDFS Scanner Average Read Throughput' value of 988.1 MiB/s, the 'Threads: Network Receive Wait Time' value was slow at 59.5m, so I did not think it was disk I/O.
Don't you think it's the process of quickly reading from disk and sending it over the network?
Created 10-04-2022 01:13 PM
I think Network time is a symptom of slow I/O...Note the RowBatchQueueGetWaitTime, between the two scans...8ms in your fast scan, the slow is 56seconds.
Notes onRowBatchQueueGetWaitTime:
Impala scanners internally have a RowBatch queue that allows Impala to decouple I/O from CPU processing. The I/O threads read data into RowBatches and put them into a queue, CPU threads asynchronously fetch data from the queue and process them. RowBatchQueueGetWaitTime is the amount of time CPU threads wait on data to arrive into the queue. Essentially, it means the CPU threads were waiting a long time for the I/O threads to read the data.
Created 10-07-2022 01:58 PM
@coco Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks
Regards,
Diana Torres,