It is possible that some malicious application master requires way more resource than cluster can offer. Based my observation, it will starve the whole cluster since the malicious application reserves all resource but still waiting for more. It seems YARN eventually times out that app master. There is property yarn.am.liveness-monitor.expiry-interval-ms seems to be relevant, but I don’t want legitimate long running am times out prematurely.
Similarly, I can submit thousands of application at same time to Yarn cluster and each will launch an AM, which could produce deadlock situation.
What’s the best way to handle this type of malicious application?
> Similarly, I can submit thousands of application at same time to Yarn cluster and each will launch an AM, which could produce deadlock situation.
That's true. For now, you can set the maxRunningApps per queue in Fair Scheduler to prevent launching too many AMs. We're going to address this better in a future release, like limiting the total resources allocated to AMs.
Thanks for your reply. Could you let me know the exact property name for max-alloc limit? Is it yarn.scheduler.maximum-allocation-mb? I am using fair-scheduler.
yarn.scheduler.maximum-allocation-mb is the max allocation for any container, regardless of queue, user or priority. This setting operates below the application level.
In Fair Scheduler, you can specify "maxResources" for any queue. That would limit the queue's total memory/cpu usage. This setting operates above the container level. See http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html