Support Questions

Find answers, ask questions, and share your expertise

Impala queries are not distributing to all the executors

avatar
Contributor

Hello Team,

 

We have 5 node Impala cluster (1 co-ordinator and 4 executors) we are running Impala 3.2.0. Each Impala node is of size 32 GB and 4 cores. Now we are facing an issue sometimes 2-3 Impala executors out of 5 are over utilised (80 - 90 %) memory usage  and other are not, for example if executor 1 and 3 have memory usage of more than 80% and some new query issued it fails saying could allocate space(512MB) on executor 1 even tough there is more than enough memory on the other executors (2,4 and 5) whose memory utilisation is under 20%. Following is the error which I receive

 

Memory limit exceeded: Failed to allocate row batch EXCHANGE_NODE (id=148) could not allocate 16.00 KB without exceeding limit. Error occurred on backend impala-node-executor-1:22000 Memory left in process limit: 19.72 GB Query(2c4bc52a309929f9:2fa5f79d00000000): Reservation=998.62 MB ReservationLimit=21.60 GB OtherMemory=104.12 MB Total=1.08 GB Peak=1.08 GB Unclaimed reservations: Reservation=183.81 MB OtherMemory=0 Total=183.81 MB Peak=398.94 MB Fragment 2c4bc52a309929f9:2fa5f79d00000021: Reservation=0

 

How are query fragments distributed among the impala executors?

Is a way to load balance the query load among executors in case when we have dedicated executor and co-ordinator?

What are the good practices to have proper utilisation of  Impala cluster?

 

 

 

Regards

Parth

 

1 ACCEPTED SOLUTION

avatar

In that case - scheduling of remote reads - for Kudu it's based on distributing the work for each scan across nodes as evenly as possible. For Kudu we randomize the assignment somewhat to even things out, but it's distribution is not based on resource availability.

 

I.e. we generate the schedule and then wait for the resources to become available on the nodes we picked. I understand that reversing that (i.e. find available nodes, then distribute work on them) would be desirable in some cases but there are pros and cons of doing that.

For remote reads from filesystems/object stores, on more recent versions, we do something a bit different - each file has affinity to a set of executors and we try to schedule it on those so that we're more likely to get hits in the remote data cache.

View solution in original post

5 REPLIES 5

avatar

You want to enable memory-based admission control - https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html#admission_con... . Without that enabled memory reservation for queries is best effort - queries just run and get whatever memory they ask for until memory is exhausted. With it enabled queries will get allocated specific amounts of memory and queries will get queued when memory is low.

 

https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_rm_example.html is a good starting point. I'd recommend setting a minimum and maximum memory limit, probably a minimum of ~1GB and a maximum of whatever you're comfortably with a single query being given.

I also gave a talk a while ago that gives an overview of some things - https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/73000.html

 

That all said, scheduling is based on data locality/affinity - the read of each input file is scheduled on a node with local replica of that file. There's also affinity to bias scheduling towards a single replica, so that the same data is read on the same node as much as possible. This minimizes network traffic and maximizes use of the OS buffer cache (i.e. maximises likelihood of reading the data from memory instead of disk).

avatar
Contributor

@Tim Armstrong  If we implement the Admission controls then we can reduce the memory exceptions, but we can still encounter the situations where queries are not admitted. With admission controls and resource we can prioritise that queries from a certain query pool to get the resources first, please correct me if I am wrong.

 

And w.r.t scheduling

In Impala we are reading data from Kudu. Impala and Kudu services both are located on different nodes. So how does scheduling work in this case?

 

Parth

avatar

You can limit the aggregate memory that any one pool will consume. There isn't exactly a priority option (there's no ability to pre-empt queries once they are running)

avatar
Contributor

@Tim Armstrong  got your point. 

 

Can you put some more light on How the Impala query fragments are distributed. When Impala and kudu services both are located on different nodes In this case  data locality principle won't hold correct me if I am wrong.

 

 

Parth 

avatar

In that case - scheduling of remote reads - for Kudu it's based on distributing the work for each scan across nodes as evenly as possible. For Kudu we randomize the assignment somewhat to even things out, but it's distribution is not based on resource availability.

 

I.e. we generate the schedule and then wait for the resources to become available on the nodes we picked. I understand that reversing that (i.e. find available nodes, then distribute work on them) would be desirable in some cases but there are pros and cons of doing that.

For remote reads from filesystems/object stores, on more recent versions, we do something a bit different - each file has affinity to a set of executors and we try to schedule it on those so that we're more likely to get hits in the remote data cache.