Support Questions

Find answers, ask questions, and share your expertise

Spark jobs stuck in ACCEPTED state for YARN despite resources available

avatar
New Contributor

Hello,

 

I'm trying to setup a Hadoop/Spark cluster using YARN through Ambari and I'm running into an issue. My setup is the following :

4 nodes with 2 node labels :

hadoop-master (label hadoop, exclusive)

hadoop-slave (label hadoop, exclusive)

spark-master (label spark)

spark-slave (label spark)

 

I have setup a YARN queue that has access to spark label and I'm trying to run a SparkPi job using that queue in cluster mode. However, despite the fact resources should be available, the job never goes past the ACCEPTED state.

 

Here is the resource logs when submitting the job :

 

 

 

20/05/06 08:38:11 INFO SecurityManager: Changing view acls to: tab
20/05/06 08:38:11 INFO SecurityManager: Changing modify acls to: tab
20/05/06 08:38:11 INFO SecurityManager: Changing view acls groups to:
20/05/06 08:38:11 INFO SecurityManager: Changing modify acls groups to:
20/05/06 08:38:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls enabled; users  with view permissions: Set(tab); groups with view permissions: Set(); users  with modify permissions: Set(tab); groups with modify permissions: Set()
20/05/06 08:38:11 INFO Client: Submitting application application_1588667960453_0011 to ResourceManager
20/05/06 08:38:11 INFO YarnClientImpl: Submitted application application_1588667960453_0011
20/05/06 08:38:12 INFO Client: Application report for application_1588667960453_0011 (state: ACCEPTED)
20/05/06 08:38:12 INFO Client:
         client token: N/A
         diagnostics: [Wed May 06 08:38:12 +0000 2020] Application is Activated, waiting for resources to be assigned for AM. Skipping assigning to Node in Ignore Exclusivity mode.  Last Node which was processed for the application : spark-master:45454 ( Partition : [spark], Total resource : <memory:4096, vCores:3>, Available resource : <memory:4096, vCores:3> ). Details : AM Partition = <DEFAULT_PARTITION> ; Partition Resource = <memory:0, vCores:0> ; Queue's Absolute capacity = 20.0 % ; Queue's Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; Queue's capacity (absolute resource) = <memory:0, vCores:0> ; Queue's used capacity (absolute resource) = <memory:0, vCores:0> ; Queue's max capacity (absolute resource) = <memory:0, vCores:0> ;
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: tab
         start time: 1588754291472
         final status: UNDEFINED
         tracking URL: http://hadoop-master:8088/proxy/application_1588667960453_0011/
         user: tab

 

 

 

And here the command I used :

 

 

./bin/spark-submit --class org.apache.spark.examples.SparkPi     --master yarn  --deploy-mode cluster     --num-executors 3     --driver-memory 512m     --executor-memory 512m     --executor-cores 1  --queue tab    examples/jars/spark-examples_2.11-2.3.2.3.1.4.0-315.jar 10

 

 

 

Maybe I'm missing something obvious, but I couldn't find what, from a logical point of view everything seemed fine as I only want jobs to run on nodes with the spark label and I want to have a way to allocate specific quotas through queues for users as well.

 

Edit : Found something that I thought might help but it wasn't the case. Gonna add the YARN Scheduler configuration just in case. As you can see root and root.tab queues properly have access to the spark node label.

 

yarn.scheduler.capacity.maximum-am-resource-percent=0.8
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.queue-mappings-override.enable=false
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
yarn.scheduler.capacity.root.accessible-node-labels=spark
yarn.scheduler.capacity.root.accessible-node-labels.spark.capacity=100
yarn.scheduler.capacity.root.accessible-node-labels.spark.maximum-capacity=100
yarn.scheduler.capacity.root.acl_administer_queue=yarn,spark,hive
yarn.scheduler.capacity.root.acl_submit_applications=yarn,ambari-qa
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.dat.accessible-node-labels=spark
yarn.scheduler.capacity.root.dat.accessible-node-labels.spark.capacity=80
yarn.scheduler.capacity.root.dat.accessible-node-labels.spark.maximum-capacity=100
yarn.scheduler.capacity.root.dat.acl_administer_queue=ubuntu
yarn.scheduler.capacity.root.dat.acl_submit_applications=dat
yarn.scheduler.capacity.root.dat.capacity=80
yarn.scheduler.capacity.root.dat.maximum-capacity=100
yarn.scheduler.capacity.root.dat.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.dat.ordering-policy=fifo
yarn.scheduler.capacity.root.dat.priority=0
yarn.scheduler.capacity.root.dat.state=RUNNING
yarn.scheduler.capacity.root.dat.user-limit-factor=1
yarn.scheduler.capacity.root.priority=0
yarn.scheduler.capacity.root.queues=dat,tab
yarn.scheduler.capacity.root.tab.accessible-node-labels=spark
yarn.scheduler.capacity.root.tab.accessible-node-labels.spark.capacity=20
yarn.scheduler.capacity.root.tab.accessible-node-labels.spark.maximum-capacity=100
yarn.scheduler.capacity.root.tab.acl_administer_queue=ubuntu
yarn.scheduler.capacity.root.tab.acl_submit_applications=tab
yarn.scheduler.capacity.root.tab.capacity=20
yarn.scheduler.capacity.root.tab.maximum-capacity=100
yarn.scheduler.capacity.root.tab.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.tab.ordering-policy=fifo
yarn.scheduler.capacity.root.tab.priority=0
yarn.scheduler.capacity.root.tab.state=RUNNING
yarn.scheduler.capacity.root.tab.user-limit-factor=1

 

4 REPLIES 4

avatar
New Contributor

I have found what I would consider a workaround for the current issue. When I launch a job with the following parameter : 

--conf spark.yarn.am.nodeLabelExpression="spark"

 I consider that a workaround because I fail to see why this unlocks the situation.

To me, it looks like the AM container is failing to fetch resources because it tries to get them from the DEFAULT partition. However, the "spark" node label is not exclusive, which means that resources should be available from the spark-master and spark-slave nodes to be allocated to the AM container.

I would like to find a way to not have to use this extra conf when I submit a job to Spark.

avatar
Cloudera Employee

Hi,

 

Did you check whether the Queue has enough resources available? Did you tried submitting the job in other queues and let us know the updates?

 

Thanks

AKR

avatar
New Contributor

I am not quite sure how to check if the queue has enough resources available. I would assume it does, since I allocated capacity to the queue, then associated the queue with a node label that I know has resources available.

I tried submitting jobs with the peer level queue named "dat" and the result was the same, job stuck in ACCEPTED state.

avatar
Cloudera Employee

Hi,

 

With respect to checking the resources availability, Please check the Resource manager WebUI for available / used memory and vcores. Also in the scheduler page check whether the Queue has the enough resources available. Try submitting the job to some other queue and let us know the updates,

 

Please also share the RM logs / application logs to check for any errors.

 

Thanks

AKR