Support Questions

Find answers, ask questions, and share your expertise

completed containers does not freed - Yarn resources not freed until entire job is completed

avatar
Rising Star

hi, i have a large cluster with multiple queues but in this scenario only two are used

bt.a

bt.b

each queue can scale to 100% of the cluster ( total 200 nodes)



the problem is when one user is running a large job of 1000 executors. and he have less than 200 executors, for example 150 is running (850 completed), 50 should be allocated to the second user.

but the second user does not receive new allocations and the second user starts only when the first user fully completed all executors (jobe completed )


sometimes it comes to a situation when only 1 worker still running(999 completed), and 199 are "sort of pending and holds all the resources " and the cluster seams fully populated, and frees only when the 1 worker is done

8 REPLIES 8

avatar
Rising Star

anyone got an idea? it is major issue for us

avatar
Super Collaborator

avatar
Rising Star

this is not the issue, here is the queue structure each user can use the entire cluster, but it frees resources only when the entire application (and all its executors ) are done, ( my problem is then when an executor is ended and there are no pending jobs the exectures is not freed)

for instance the application submitted job that have Tasks 200 execution and got 200 workers (each with 1 core), when the tasks are ending the executor is not freed, (in the worst case the application 1/200 (1 running) but the entire 200 resources are still allocated. i would expect that 199 will be return to the cluster
107483-queue.png

avatar
New Contributor

Hi Ilia, Have you solved this issue? please update.

avatar
Rising Star
nope, still looking for a solution

avatar
Super Collaborator

For this you need to use the spark dynamic allocation.

Dynamic Allocation (of Executors) (aka Elastic Scaling) is a Spark feature that allows for adding or removing Spark executors dynamically to match the workload.

Unlike the "traditional" static allocation where a Spark application reserves CPU and memory resources upfront (irrespective of how much it may eventually use), in dynamic allocation you get as much as needed and no more. It scales the number of executors up and down based on workload, i.e. idle executors are removed, and when there are pending tasks waiting for executors to be launched on, dynamic allocation requests them

avatar
Rising Star

just checked my configs and dynamic is already set

Advanced spark2-thrift-sparkconf

spark.dynamicAllocation.enabled
spark.dynamicAllocation.initialExecutors
spark.dynamicAllocation.maxExecutors
spark.dynamicAllocation.minExecutors


here are the logs for starting a job

19/03/31 06:23:17 INFO RMProxy: Connecting to ResourceManager at grid-master.MyDomain.com/XXX.XXX.XXX:8030

19/03/31 06:23:17 INFO YarnRMClient: Registering the ApplicationMaster

19/03/31 06:23:17 INFO Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml

19/03/31 06:23:17 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://YarnAM@grid-worker-102.MyDomain.com:44575)

19/03/31 06:23:17 INFO YarnAllocator: Will request 1220 executor container(s), each with 1 core(s) and 2432 MB memory (including 384 MB of overhead)

19/03/31 06:23:17 INFO YarnAllocator: Submitted 1220 unlocalized container requests.

19/03/31 06:23:17 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals

19/03/31 06:23:18 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000002 on host grid-02.MyDomain.com for executor with ID 1

19/03/31 06:23:18 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000003 on host grid-worker-101.MyDomain.com for executor with ID 2

19/03/31 06:23:18 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000004 on host grid-04.MyDomain.com for executor with ID 3

19/03/31 06:23:18 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000005 on host grid-05.MyDomain.com for executor with ID 4

19/03/31 06:23:18 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000006 on host grid-03.MyDomain.com for executor with ID 5

19/03/31 06:23:18 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000007 on host grid-worker-102.MyDomain.com for executor with ID 6

19/03/31 06:23:18 INFO YarnAllocator: Received 6 containers from YARN, launching executors on 6 of them.

19/03/31 06:23:18 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000008 on host grid-01.MyDomain.com for executor with ID 7

19/03/31 06:23:18 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.

19/03/31 06:23:19 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000009 on host grid-02.MyDomain.com for executor with ID 8

19/03/31 06:23:19 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000010 on host grid-worker-101.MyDomain.com for executor with ID 9

19/03/31 06:23:19 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000011 on host grid-04.MyDomain.com for executor with ID 10

19/03/31 06:23:19 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000012 on host grid-05.MyDomain.com for executor with ID 11

19/03/31 06:23:19 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000013 on host grid-03.MyDomain.com for executor with ID 12

19/03/31 06:23:19 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000014 on host grid-worker-102.MyDomain.com for executor with ID 13

19/03/31 06:23:19 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000015 on host grid-01.MyDomain.com for executor with ID 14

19/03/31 06:23:19 INFO YarnAllocator: Received 7 containers from YARN, launching executors on 7 of them.

19/03/31 06:23:21 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000016 on host grid-worker-101.MyDomain.com for executor with ID 15

19/03/31 06:23:21 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000017 on host grid-02.MyDomain.com for executor with ID 16

19/03/31 06:23:21 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000018 on host grid-04.MyDomain.com for executor with ID 17

19/03/31 06:23:21 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000019 on host grid-03.MyDomain.com for executor with ID 18

19/03/31 06:23:21 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000020 on host grid-05.MyDomain.com for executor with ID 19

19/03/31 06:23:21 INFO YarnAllocator: Launching container container_e33_1553767920480_0028_01_000021 on host grid-worker-102.MyDomain.com for executor with ID 20


avatar
Super Collaborator

You need to follow these as those are for spark thrift


Configuring Cluster Dynamic Resource Allocation Manually

To configure a cluster to run Spark jobs with dynamic resource allocation, complete the following steps:

  1. Add the following properties to the spark-defaults.conf file associated with your Spark installation (typically in the $SPARK_HOME/conf directory):

    • Set spark.dynamicAllocation.enabled to true.

    • Set spark.shuffle.service.enabled to true.

  2. (Optional) To specify a starting point and range for the number of executors, use the following properties:

    • spark.dynamicAllocation.initialExecutors

    • spark.dynamicAllocation.minExecutors

    • spark.dynamicAllocation.maxExecutors

  3. Note that initialExecutors must be greater than or equal to minExecutors, and less than or equal to maxExecutors.

    For a description of each property, see Dynamic Resource Allocation Properties.

  4. Start the shuffle service on each worker node in the cluster:

    1. In the yarn-site.xml file on each node, add spark_shuffle to yarn.nodemanager.aux-services, and then set yarn.nodemanager.aux-services.spark_shuffle.class to org.apache.spark.network.yarn.YarnShuffleService.

    2. Review and, if necessary, edit spark.shuffle.service.* configuration settings.

      For more information, see the Apache Spark Shuffle Behavior documentation.

    3. Restart all NodeManagers in your cluster.