Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to improve Impala Query Performance?

How to improve Impala Query Performance?

Hi Team,
I am trying to improve the Impala query performance after installation means without tweaking the configuration values which we do after instsallation.
My objective is that when someone has already followed the performance optimization steps(configured Impala) and after that wants to improve the query performance then how he can do that.

My Approach is that I am passing query options in JDBC URL and analysing the query attributes values after query execution. 

I've found following query options which can be helpfull in improving the query performance - 

BATCH_SIZE 
MT_DOP Query
MAX_SCAN_RANGE_LENGTH Query Option
MAX_NUM_RUNTIME_FILTERS Query Option (CDH 5.7 or higher only)
RUNTIME_FILTER_MAX_SIZE Query Option (CDH 5.8 or higher only)
RUNTIME_FILTER_MIN_SIZE Query Option (CDH 5.8 or higher only)
RUNTIME_BLOOM_FILTER_SIZE Query Option (CDH 5.7 or higher only)
OPTIMIZE_PARTITION_KEY_SCANS.


I want to confirm whethere these query option can help me in optimizing the peformance or do I need to use any other approach?

This is how my task will flow - 

(1) First user will run the query, if user will find any problem during executing the query then he will provide the query to me
(2) Then I'll run the query with above mentioned impala query options and analysed the query result. 

Though I've run the various queries many times with different combinations but I've not found any attribute which says that my query is performing good or bad. So far I have found "cm_cpu_milliseconds" query attribute which says that my query is taking this much amount of time.

Which else query attribute I can consider for performance ->

"thread_storage_wait_time" : "4841",
"session_id" : "494e003b0d002762:cea3448d541c5cab",
"hdfs_bytes_read_remote_percentage" : "0",
"stats_missing" : "false",
"thread_total_time" : "72037",
"thread_network_send_wait_time_percentage" : "1",
"thread_network_receive_wait_time_percentage" : "16",
"network_address" : "::ffff:172.29.48.95:49409",
"pool" : "root.imp_1345",
"hdfs_bytes_read_from_cache" : "0",
"hdfs_bytes_read_local_percentage" : "100",
"hdfs_scanner_average_bytes_read_per_second" : "2.0037248013310297E9",
"hdfs_bytes_read_from_cache_percentage" : "0",
"bytes_streamed" : "113333602",
"session_type" : "HIVESERVER2",
"cm_cpu_milliseconds" : "54915.848796",
"thread_network_receive_wait_time" : "11387",
"estimated_per_node_peak_memory" : "1116106843",
"query_status" : "OK",
"thread_storage_wait_time_percentage" : "7",
"memory_aggregate_peak" : "4.3511709696E8",
"admission_result" : "Admitted immediately",
"hdfs_average_scan_range" : "305033.8186144897",
"planning_wait_time" : "161",
"oom" : "false",
"memory_accrual" : "1.5937056E7",
"memory_spilled" : "0",
"thread_cpu_time_percentage" : "76",
"admission_wait" : "0",
"file_formats" : "TEXT/NONE",
"hdfs_bytes_read_local" : "1730456853",
"hdfs_bytes_read_short_circuit" : "1730456853",
"hdfs_bytes_read_short_circuit_percentage" : "100",
"planning_wait_time_percentage" : "4",
"client_fetch_wait_time" : "96",
"client_fetch_wait_time_percentage" : "2",
"memory_per_node_peak_node" : "abc.com",
"memory_per_node_peak" : "4.3511709696E8",
"connected_user" : "abc",
"hdfs_bytes_read_remote" : "0",
"thread_network_send_wait_time" : "891",
"impala_version" : "impalad version 2.7.0-cdh5.9.3 RELEASE (build 3f510986fc1a0e02c5a127303017db5253d93da6)",
"original_user" : "abc.com",
"thread_cpu_time" : "54915",
"hdfs_bytes_read" : "1730456853"


2 REPLIES 2

Re: How to improve Impala Query Performance?

New Contributor

EXPLAIN PLAN is a great tool to help you understand how a query performs. For instance, it will let you know if a full table scan is occuring where you might have expected otherwise. Go here for more info and other tools: https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_explain_plan.html 

Highlighted

Re: How to improve Impala Query Performance?

Actually I want to use the statistics which is present the query Id after execution like query attributes "estimated_per_node_peak_memory", "thread_storage_wait_time_percentage", "memory_spilled". I want to compare it after query is execution is completed. 

Which attributes will tell me that my query's performance is good or bad. 

By using which query option(like BATCH_SIZE, MT_DOP, MAX_SCAN_RANGE_LENGTH) query performance will go up. 

I've finalized following query options which can help me in improving query performance - 

BATCH_SIZE 
MT_DOP
MAX_SCAN_RANGE_LENGTH
MAX_NUM_RUNTIME_FILTERS 
RUNTIME_FILTER_MAX_SIZE 
RUNTIME_FILTER_MIN_SIZE 
RUNTIME_BLOOM_FILTER_SIZE 
OPTIMIZE_PARTITION_KEY_SCANS


Can I use any other query options to improve performance?

Which query attributes will tell me that my query is doing good or bad?