How many concurrent jobs are you running and how much memory you have in the cluster?
See where the memory is going to.
Maybe if you can post a screenshot of your resource manager status I can help you.
Hi, thanks I got it. I was confused with prop names for
1. single container max memory and
2. single node TOTAL MAX memory for containers.
Now I fully utilized. :)
Still in trouble.
I have to decrease resources for YARN since it collocated with HBase with heavy writes/reads.
I've created several queues
One queue is "staging"
Each night several oozie coords wake up and meterializes import jobs. Eah night all these jobs stuck. What is the right way to run several oozie jobs in one queue?
1st job starts and takes:
1 container for AM (JobTracker)
1 container for oozie launcher (special mapper which triggers action execution)
XXX containers for actual mapper
So single oozie job takes 2 apps and 2+ containers
2nd job does the same
And there is no place to run more actions, two jobs stop in deadlock.
It sounds like YARN now is very limited in resources. Strange you cannot run 2 oozie jobs if you reduced AM to 512MB
On the ozzie size, I never tried it but I believe if you set oozie.service.CallableQueueService.callable.concurrency to 1 (default is 3) it will only try to run a job type (pig, hive, scoop, etc.) at a time and thus avoiding your deadlock.
Hi, the limits for "staging" pool are:
<= 5 running apps
<= 8 cores
<= 16000MB ram
2-3 apps can run at once, no problem.
The problem is that
8 apps do start to run at one time. Each of them occupy long running oozie launcher container and there are no more free resources to run actions (pig/hive/sqoop oozie actions). And all submitted jobs stuck.
There should a pattern to resolve such problem.
I can change meterialization time for oozie coords. But it's not the solution.
I have 6 jobs to import from RDBMS. RDBMS was down. Then it's up and all 6 jobs starts to import at one time.
Deadlock happens again.
Sergey, that's a slightly different issue. While you can benefit for the slow start setting your problem is different.
The main difference with MR1 is that now you need an additional "slot" for the "Application Master" which is what coordinates the mappers and reducers. By default each AM takes 1GB of memory so if you ran 16 applications (jobs) in paralel they will grab 16GB of memory from your cluster before doing anything.
Imagine you have a tiny development cluster with just 16GB of memory, then you have a deadlock and nothing can finish until you start killing some jobs.
You can do two things:
-reduce the AM memory. In general 512MB is more than enough. Suddendly your 16 jobs need 8GB less of RAM. yarn.app.mapreduce.am.resource.mb
-limit the number of concurrent jobs so you can never deadlock. You can set that updating the Fair Scheduler configuration but in cloudera manager you can easily set that on the menu "Clusters" select Dynamic Resource pools and for the root queue select edit. In the Yarn tab enter a number for "Max Running Apps"