Support Questions

Find answers, ask questions, and share your expertise

Impala Query Slow Problem

New Contributor

Hi all,


Although I have partition the data table, the query duration still take a long time,


I have search into the query profile, most of the time spending in "SCAN HDFS",


How can I improve the query performance?


There are 6 Nodes in my cluster,


Please give me some advice, Thanks!


Paste the query info below:


Query Info
Query ID: fd47ab1669e06b97:c243ead53ffd6baa
User: TL120
Database: default
Coordinator: host4.localdomain
Query Type: QUERY
Query State: FINISHED
Start Time: Mar 12, 2018 1:53:05 PM
End Time: Mar 12, 2018 1:54:05 PM
Duration: 1m
Rows Produced: 588
Admission Result: Unknown
Aggregate Peak Memory Usage: 7.9 GiB
Bytes Streamed: 1.4 MiB
Client Fetch Wait Time: 969ms
Client Fetch Wait Time Percentage: 2
Connected User: TL120
Estimated per Node Peak Memory: 9.0 GiB
HDFS Average Scan Range: 105.1 MiB
HDFS Bytes Read: 183.2 GiB
HDFS Bytes Read From Cache: 0 B
HDFS Bytes Read From Cache Percentage: 0
HDFS Local Bytes Read: 183.2 GiB
HDFS Local Bytes Read Percentage: 100
HDFS Remote Bytes Read: 0 B
HDFS Remote Bytes Read Percentage: 0
HDFS Scanner Average Read Throughput: 624.1 MiB/s
HDFS Short Circuit Bytes Read: 183.2 GiB
HDFS Short Circuit Bytes Read Percentage: 100
Impala Version: impalad version 2.5.0-cdh5.7.x RELEASE (build 7ec1f598396347d722b8c56b8f4f458122ed3f4e)
Memory Accrual: 6,392,067,652 byte seconds
Memory Spilled: 0 B
Network Address: ::ffff:
Node with Peak Memory Usage: host5.localdomain:22000
Out of Memory: false
Per Node Peak Memory Usage: 7.9 GiB
Planning Wait Time: 37ms
Planning Wait Time Percentage: 0
Pool: default-pool
Query Status: OK
Session ID: b3485b36f5f668b7:ade833c16d1086
Session Type: BEESWAX
Statistics Missing: true
Threads: CPU Time: 2.5h
Threads: CPU Time Percentage: 94
Threads: Network Receive Wait Time: 3.9m
Threads: Network Receive Wait Time Percentage: 2
Threads: Network Send Wait Time: 19ms
Threads: Network Send Wait Time Percentage: 0
Threads: Storage Wait Time: 5.7m
Threads: Storage Wait Time Percentage: 4
Threads: Total Time: 2.6h
Work CPU Time: 2.5h
Download Profile...
Query Timeline
Start execution: 33.78us (33.78us)
Planning finished: 37ms (37ms)
Ready to start 11 remote fragments: 40ms (2ms)
All 11 remote fragments started: 47ms (7ms)
Rows available: 59.54s (59.49s)
First row fetched: 59.69s (156ms)
Unregister query: 1.0m (814ms)
Planner Timeline
Analysis finished: 13ms (13ms)
Equivalence classes computed: 14ms (1ms)
Single node plan created: 22ms (8ms)
Runtime filters computed: 22ms (20.58us)
Distributed plan created: 26ms (3ms)
Lineage info computed: 27ms (926.52us)
Planning finished: 30ms (3ms)



Cloudera Employee

Two things that immediately jump out at me:

- Under "File Formats" both parquet and text are listed. For best performance, we recommend using parquet wherever possible.

- The version of Impala you are using is old - 2.5 vs. the latest release of 2.11. There are a number of perf improvements that have gone in recently, though its hard to say if those will help you without knowing more about your use case.


If neither of those suggestions help, you might consider posting the complete query profile, available at the "Download profile" link, which contains a lot more info about what was going on during your query.