Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Yarn applications hang foreever if run in parallel

avatar
Explorer

Hi,

 

we have a cluster with 8 nodes on CDH5 (5.0.2) with Yarn MRv2 in use and a big problem which is probably due to the Config.

 

In addition to Hadoop, we also use Imapala so we can not use give all ressoures to yarn.

Each of our nodes have 128GB of RAM and 12 cores.

 

Currently sees the Memory config for Yarn as follows:

 

mapreduce.map.memory.mb = 8Gib

mapreduce.reduce.memory.mb = 8Gib

yarn.app.mapreduce.am.resource.mb = 8Gib

mapreduce.map.java.opts.max.heap = 6960MiB

mapreduce.reduce.java.opts.max.heap = 6960MiB

"Java Heap Size in bytes of NodeManager" = 8Gib

yarn.nodemanager.resource.memory-mb = 80Gib

 

Now we get the problem that if we run multiple applications in parallel all stop and no one finished.

it looks as if they hang forever. I see no exception or errors in "/var/log/hadoop-yarn" (Debug Log Level).

 

I would be glad if someone can help? 🙂

 

BG

1 ACCEPTED SOLUTION

avatar
Rising Star

On a small cluster, sometimes all the resources are occupied by AMs, and no real work get done. See https://issues.apache.org/jira/browse/YARN-1913. One workaround is to configure the `maxRunningApps' to a smaller number. See http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.

View solution in original post

5 REPLIES 5

avatar
Mentor
Can you post more details on what you mean by 'multiple applications' (and how many, exactly), as well as your scheduler configuration?

What behaviour do you notice exactly when you say they all 'stop'. Do you mean their AppMasters run but the actual application containers (i.e. map or reduce tasks) do not run, or do you mean they all just fail?

avatar
Contributor

I am experiencing the same problem stated earlier.  We have a 4-node cluster using YARN on v5.1.0.  I have an Oozie workflow that uses Sqoop to import from MySQL, which is sharded with 10 tables.  Therefore, I have a coordinator that executes the same workflow with 10 simultaneous (parallel) sessions, to pull from each sharded table.

 

However, sometime after the workflows reach the Sqoop action step, they stop running.  The jobs are not failing, rather they stop processing, even though their status shows "Running" in the Hue workflow dashboard.  None of the jobs have had any updated status in the SysLog for more than 12 hours.

 

Further, if other, unrelated jobs are submitted, they also appear to hang.  I have had a job running successfully for several days, which is executing a DISTCP command to import S3 data.  This job has also hung after submitting the 10 parallel workflows.

 

Is there a configuration that must be set to allows the same workflow to be processed in parallel?

 

Thank you!

 

Michael Reynolds

 

avatar
Rising Star

On a small cluster, sometimes all the resources are occupied by AMs, and no real work get done. See https://issues.apache.org/jira/browse/YARN-1913. One workaround is to configure the `maxRunningApps' to a smaller number. See http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.

avatar
Contributor
Thank you very much for your assistance! It is now working fine. Michael Reynolds

avatar
Contributor

We are still experiencing periodic problems with applications hanging when a number of jobs are submitted in parallel.  We have reduced 'maxRunningApps', increased the virtual core count, and also increased 'oozie.service.callablequeueservice.threads' to 40.  In many cases, the applications do not hang, however this is not consistent.

 

Regarding YARN issue number 1913 (https://issues.apache.org/jira/browse/YARN-1913), is this patch incorporated in CDH 5.1.0, the version we are using?  YARN-1913 indicates the affected version is 2.3.0, and is fixed in 2.5.0.  Our Hadoop version in 5.1.0 is 2.3.0.

 

Thank you,

 

Michael Reynolds