Created 12-31-2018 11:33 PM
I am using 3.0.0.0-1634.
The annoying part is that the issue is random.The same job runs fine and suddenly it fails with this issue and next time some other job might fail after couple of days which had successfully ran earlier and which will run properly in future. There is no other jobs running during that time on the cluster.
18/12/29 20:25:21 INFO mapreduce.Job: Running job: job_1546013184089_0046 18/12/29 20:49:47 INFO mapreduce.Job: Job job_1546013184089_0046 running in uber mode : false 18/12/29 20:49:47 INFO mapreduce.Job: map 0% reduce 0% 18/12/29 20:49:47 INFO mapreduce.Job: Job job_1546013184089_0046 failed with state FAILED due to: Application application_1546013184089_0046 failed 2 times due to ApplicationMaster for attempt appattempt_1546013184089_0046_000002 timed out. Failing the application. 18/12/29 20:49:47 INFO mapreduce.Job: Counters: 0 18/12/29 20:49:47 ERROR crawl.DeduplicationJob: DeduplicationJob: java.io.IOException: Job failed!
In Resource manager in Diagnostics I found below error:
Application application_1546114179060_0069 failed 2 times due to ApplicationMaster for attempt appattempt_1546114179060_0069_000002 timed out. Failing the application.
I also found below error somewhere in the log:
java.lang.Exception: Container is not yet running. Current state is LOCALIZING
It's a 4 node cluster. Available vcore is 100 and memory is 416 GB when there is nothing running on the cluster.
Jobs are submitted through default queues. Minimum container size is 4GB. Capacity Scheduler: capacity-scheduler=null yarn.scheduler.capacity.maximum-am-resource-percent=0.5 yarn.scheduler.capacity.maximum-applications=10000 yarn.scheduler.capacity.node-locality-delay=40 yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator yarn.scheduler.capacity.root.accessible-node-labels=* yarn.scheduler.capacity.root.acl_administer_queue=* yarn.scheduler.capacity.root.acl_submit_applications=* yarn.scheduler.capacity.root.capacity=100 yarn.scheduler.capacity.root.default.acl_administer_jobs=* yarn.scheduler.capacity.root.default.acl_submit_applications=* yarn.scheduler.capacity.root.default.capacity=100 yarn.scheduler.capacity.root.default.maximum-capacity=100 yarn.scheduler.capacity.root.default.state=RUNNING yarn.scheduler.capacity.root.default.user-limit-factor=1 yarn.scheduler.capacity.root.queues=default yarn.scheduler.capacity.schedule-asynchronously.enable=true yarn.scheduler.capacity.schedule-asynchronously.maximum-threads=1 yarn.scheduler.capacity.schedule-asynchronously.scheduling-interval-ms=10
The tried restarting the whole cluster once. I also tried just restarting yarn once.
The funny part is the job runs fine without doing all this if I just rerun the job without changing anything.
I never faced this issue it suddenly started popping up since last few days.
I just changed minimum container size from 13 GB to 4 GB. But everything worked fine for 2 weeks after that. So I don't think that might be the issue. Apart from that I haven't changed anything on the cluster.
Created 01-02-2019 11:18 AM
Seems it's similar like what we discussed in below thread
If resubmit jobs will get success ? As discussed earlier this is open bug which fixed in further releases. If you need to apply patch, please involve Hortonworks support. If you are a customer, HWX can release a patch for you if it's technically possible based on specifics of the JIRAs. If you don't have support, you can certainly do it but test it first apply the patch in dev/test and see if it resolves your problem.
Created 01-02-2019 09:29 AM
@Jagadeesan A S Request you to confirm if it's the same issue as revert YARN-6078 ?
Is there a way to avoid or minimize the occurrence of the issue by some configuration changes?
Created 01-02-2019 11:18 AM
Seems it's similar like what we discussed in below thread
If resubmit jobs will get success ? As discussed earlier this is open bug which fixed in further releases. If you need to apply patch, please involve Hortonworks support. If you are a customer, HWX can release a patch for you if it's technically possible based on specifics of the JIRAs. If you don't have support, you can certainly do it but test it first apply the patch in dev/test and see if it resolves your problem.