Created 05-02-2018 09:19 AM
We're running a HDP 2.5 cluster and today we noticed a series of dr.who "MYYARN" applications running, failing, and then resubmitting to YARN again and again. In what seems to be an "infinite loop". We can't figure out what the applications are doing and why they are failing. Any thoughts? Many thanks in advance!
Created 05-02-2018 03:04 PM
No. I'm the only user connected. And while my cluster is not kerberized, my Ambari connection is made through HTTPS.
Created 05-02-2018 12:16 PM
Our jobs are indeed stuck in "ACCEPTED" status, and then eventually fail due to a time-out. I can't get any further useful log information. Having checked the RM UI logs for "FAILED" jobs, I noticed it started on April 30 for 3 hours straight then stoped. It started again on May 1 up until today.
Created 05-02-2018 01:21 PM
This is typically the case when the resources are exceeded. This could be the memory of the node, but also the queue itself. Can you check if the jobs getting stuck are all submitted in the same queue?
https://community.hortonworks.com/questions/96750/yarn-application-stuck-in-accepted-state.html
Created 05-02-2018 01:34 PM
I'm having the exact same issue. All of a sudden yesterday - on a cluster that has been up and running for weeks - started spawning six of these at a time for no apparent reason. I kill them and they come back. I've poured over every single log, checked every nook and cranny and cannot figure it out. I have no idea where they are coming from. It is most definitely not a resource issue - these jobs shouldn't even be running - and it's not cron. They are sucking up major CPU when it runs.
If anyone has any thoughts I'd be grateful to hear them!
The other odd thing is that in the past I would see one of these jobs - but only one - never like this.
,I'm having the exact same problem. All of a sudden on a cluster that has not changed is spawning off these jobs that are in ACCEPTED status as user Dr.Who and called MYYARN. I've poured over every single log, bounced my cluster several times, there are no cron jobs and it is most definitely not a resource issue. Looking at old logs it looks like it happened periodically - but only once of twice and then it stops. Yesterday it started running wild and as quick as I kill them off it starts another 6 of the exact same job. If anyone has any insight at all I'd be grateful.
And I'm not even using HDP - this is standard Apache Hadoop/Yarn/Spark 2.7.5
Created 05-02-2018 01:36 PM
I am wondering if this a security loophole ,since my cluster is not yet kerberized !
Created 05-02-2018 01:51 PM
I have the same question (for the same reason, ie. not being kerberized yet).
Created 05-02-2018 01:43 PM
I'm having the exact same issue. All of a sudden yesterday - on a cluster that has been up and running for weeks - started spawning six of these at a time for no apparent reason. I kill them and they come back. I've poured over every single log, checked every nook and cranny and cannot figure it out. I have no idea where they are coming from. It is most definitely not a resource issue - these jobs shouldn't even be running - and it's not cron. They are sucking up major CPU when it runs.
If anyone has any thoughts I'd be grateful to hear them!
The other odd thing is that in the past I would see one of these jobs - but only one - never like this.
Created 05-02-2018 01:48 PM
Temporary workaround could be set hadoop.http.staticuser.user=testuser
assign testuser to queue testqueue with 1% resources ?
Created 05-02-2018 10:21 PM
(this might be the real answer)
It looks like some kind of an attack. I have seen it on 2 clusters, 1 running HDP and 1 running Hadoop 2.7.4..
Created 05-02-2018 11:01 PM
Using iptables firewall, I blocked port 8088 and the situiation improved. Too soon to tell if this is a real fix.