Created on 07-16-2014 08:47 AM - edited 09-16-2022 02:02 AM
Hi,
we have a cluster with 8 nodes on CDH5 (5.0.2) with Yarn MRv2 in use and a big problem which is probably due to the Config.
In addition to Hadoop, we also use Imapala so we can not use give all ressoures to yarn.
Each of our nodes have 128GB of RAM and 12 cores.
Currently sees the Memory config for Yarn as follows:
mapreduce.map.memory.mb = 8Gib
mapreduce.reduce.memory.mb = 8Gib
yarn.app.mapreduce.am.resource.mb = 8Gib
mapreduce.map.java.opts.max.heap = 6960MiB
mapreduce.reduce.java.opts.max.heap = 6960MiB
"Java Heap Size in bytes of NodeManager" = 8Gib
yarn.nodemanager.resource.memory-mb = 80Gib
Now we get the problem that if we run multiple applications in parallel all stop and no one finished.
it looks as if they hang forever. I see no exception or errors in "/var/log/hadoop-yarn" (Debug Log Level).
I would be glad if someone can help? 🙂
BG
Created 08-12-2014 11:38 AM
On a small cluster, sometimes all the resources are occupied by AMs, and no real work get done. See https://issues.apache.org/jira/browse/YARN-1913. One workaround is to configure the `maxRunningApps' to a smaller number. See http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
Created 07-19-2014 11:19 PM
Created 08-12-2014 10:19 AM
I am experiencing the same problem stated earlier. We have a 4-node cluster using YARN on v5.1.0. I have an Oozie workflow that uses Sqoop to import from MySQL, which is sharded with 10 tables. Therefore, I have a coordinator that executes the same workflow with 10 simultaneous (parallel) sessions, to pull from each sharded table.
However, sometime after the workflows reach the Sqoop action step, they stop running. The jobs are not failing, rather they stop processing, even though their status shows "Running" in the Hue workflow dashboard. None of the jobs have had any updated status in the SysLog for more than 12 hours.
Further, if other, unrelated jobs are submitted, they also appear to hang. I have had a job running successfully for several days, which is executing a DISTCP command to import S3 data. This job has also hung after submitting the 10 parallel workflows.
Is there a configuration that must be set to allows the same workflow to be processed in parallel?
Thank you!
Michael Reynolds
Created 08-12-2014 11:38 AM
On a small cluster, sometimes all the resources are occupied by AMs, and no real work get done. See https://issues.apache.org/jira/browse/YARN-1913. One workaround is to configure the `maxRunningApps' to a smaller number. See http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
Created 08-12-2014 02:34 PM
Created 08-18-2014 03:29 PM
We are still experiencing periodic problems with applications hanging when a number of jobs are submitted in parallel. We have reduced 'maxRunningApps', increased the virtual core count, and also increased 'oozie.service.callablequeueservice.threads' to 40. In many cases, the applications do not hang, however this is not consistent.
Regarding YARN issue number 1913 (https://issues.apache.org/jira/browse/YARN-1913), is this patch incorporated in CDH 5.1.0, the version we are using? YARN-1913 indicates the affected version is 2.3.0, and is fixed in 2.5.0. Our Hadoop version in 5.1.0 is 2.3.0.
Thank you,
Michael Reynolds