Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

yarn job stuck in accepted state randomly with no other jobs running on cluster

avatar
Explorer

I am using 3.0.0.0-1634.

The annoying part is that the issue is random.The same job runs fine and suddenly it fails with this issue and next time some other job might fail after couple of days which had successfully ran earlier and which will run properly in future. There is no other jobs running during that time on the cluster.

	18/12/29 20:25:21 INFO mapreduce.Job: Running job: job_1546013184089_0046
18/12/29 20:49:47 INFO mapreduce.Job: Job job_1546013184089_0046 running in uber mode : false 18/12/29 20:49:47 INFO mapreduce.Job:  map 0% reduce 0%
18/12/29 20:49:47 INFO mapreduce.Job: Job job_1546013184089_0046 failed with state FAILED due to: Application application_1546013184089_0046 failed 2 times due to ApplicationMaster for attempt appattempt_1546013184089_0046_000002 timed out. Failing the application.
18/12/29 20:49:47 INFO mapreduce.Job: Counters: 0
18/12/29 20:49:47 ERROR crawl.DeduplicationJob: DeduplicationJob: java.io.IOException: Job failed!

In Resource manager in Diagnostics I found below error:

	Application application_1546114179060_0069 failed 2 times due to ApplicationMaster for attempt appattempt_1546114179060_0069_000002 timed out. Failing the application.

I also found below error somewhere in the log:

	java.lang.Exception: Container is not yet running. Current state is LOCALIZING

It's a 4 node cluster. Available vcore is 100 and memory is 416 GB when there is nothing running on the cluster.

	Jobs are submitted through default queues.

	Minimum container size is 4GB.

	Capacity Scheduler:

	capacity-scheduler=null
yarn.scheduler.capacity.maximum-am-resource-percent=0.5
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.acl_submit_applications=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.default.acl_administer_jobs=*
yarn.scheduler.capacity.root.default.acl_submit_applications=*
yarn.scheduler.capacity.root.default.capacity=100
yarn.scheduler.capacity.root.default.maximum-capacity=100
yarn.scheduler.capacity.root.default.state=RUNNING
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.queues=default
yarn.scheduler.capacity.schedule-asynchronously.enable=true
yarn.scheduler.capacity.schedule-asynchronously.maximum-threads=1
yarn.scheduler.capacity.schedule-asynchronously.scheduling-interval-ms=10

The tried restarting the whole cluster once. I also tried just restarting yarn once.

The funny part is the job runs fine without doing all this if I just rerun the job without changing anything.

I never faced this issue it suddenly started popping up since last few days.

I just changed minimum container size from 13 GB to 4 GB. But everything worked fine for 2 weeks after that. So I don't think that might be the issue. Apart from that I haven't changed anything on the cluster.

1 ACCEPTED SOLUTION

avatar
Master Collaborator

@Suraj Singh

Seems it's similar like what we discussed in below thread

https://community.hortonworks.com/questions/232093/yarn-jobs-are-getting-stuck-in-accepted-state.htm...

If resubmit jobs will get success ? As discussed earlier this is open bug which fixed in further releases. If you need to apply patch, please involve Hortonworks support. If you are a customer, HWX can release a patch for you if it's technically possible based on specifics of the JIRAs. If you don't have support, you can certainly do it but test it first apply the patch in dev/test and see if it resolves your problem.

View solution in original post

2 REPLIES 2

avatar
Explorer

@Jagadeesan A S Request you to confirm if it's the same issue as revert YARN-6078 ?

Is there a way to avoid or minimize the occurrence of the issue by some configuration changes?

avatar
Master Collaborator

@Suraj Singh

Seems it's similar like what we discussed in below thread

https://community.hortonworks.com/questions/232093/yarn-jobs-are-getting-stuck-in-accepted-state.htm...

If resubmit jobs will get success ? As discussed earlier this is open bug which fixed in further releases. If you need to apply patch, please involve Hortonworks support. If you are a customer, HWX can release a patch for you if it's technically possible based on specifics of the JIRAs. If you don't have support, you can certainly do it but test it first apply the patch in dev/test and see if it resolves your problem.