Support Questions

Find answers, ask questions, and share your expertise

Impala concurrent query delay

Explorer

My cluster configuration is as follows:

  1. 3 Node cluster
  2. 128GB RAM per cluster node.
  3. Processor: 16 core HyperThreaded per cluster node.

All 3 nodes have Kudu master and T-Server and Impala server, one of the node has Impala catalogue and Impala StateStore.

 

My issues are as follows:

 

1) I've a hard time figuring out Dynamic resource pooling in impala while running concurrent queries. I've tried giving mem_limit still no luck. I've also tried static service pool but with that also I couldn't achieve required concurrency. Even with admission control, the required concurrency was not achieved.

     

     I) The time taken for 1 query: 500-800ms.

     II) But if 10 concurrent queries are given the time taken grows to 3-6s per query.

     III) But if more than 20 concurrent queries are given the time taken is exceeding 10s per query.

 

2) One of my cluster nodes is not taking the load after submitting the query, I checked this by the summary of the query. I've tried giving the NUM_NODES as 0 and 1 on the node which is not taking the load, still, the summary shows that the node is not taking the load.

 

6 REPLIES 6

Master Collaborator
Three nodes is very small number.
Are you load balancing the queries across Impala Daemons.
If just two ID are working on the query it means you are running queries on small data (i.e. blocks are just on two nodes). What kind of queries are you running (are there just scans, or brodcasts?)

Explorer

Q) Are you load balancing the queries across Impala Daemons.

 

Ans) How do I load balance the queries across impala demons?

 

Q) If just two ID are working on the query it means you are running queries on small data (i.e. blocks are just on two nodes). What kind of queries are you running (are there just scans, or broadcasts?)

 

Ans) The queries contains multi-joins to various tables. I've tried giving a bigger query which takes around 10-15sec but still the query is not going to that specific node, Is there any way to check why it is not distributing the load to that specific node?

 

According to Cloudera documentationOnly accepts the values 0 (meaning all nodes) or 1 (meaning all work is done on the coordinator node). Check the documentation here NUM_NODES.

 

Even after setting the NUM_NODES to 1 for that specific node, the query still it goes to any one of the other nodes. 

 

 

 

 

If one query is able to max out some resource (CPU, I/O, etc) then increasing concurrency of queries will not increase throughput of queries since the queries were already resource-constrained. That seems to be the effect that you're seeing.

Explorer
Then how to share resources so that each query get equal CPU & Memory?

Cloudera Employee
Impala currently does not support enforcing limits on CPU time. For memory, you can use memory based admission control. Can you please share how you are setting up admission control? Did you set both the "Max Memory" and the "Default Query Memory Limit" for the resource pool?

Explorer

Hi @BikramjeetVig

I have tried setting the mem_limit only to the Impala conf file. But I didn't find any performance boost in the concurrency performance.