Support Questions

Find answers, ask questions, and share your expertise

Application Failed for YARN exit code 12

avatar
Contributor

Hi,

 

I've a yarn application launched via oozie in yarn-cluster mode that sometimes fails for an unknown error.

 

The stdout and stderr logs from the driver don't any error (they are cutoff in the middle of some INFO messages), but I've found a strange error in the log of the NodeManager running the AM container:

 

2017-XX-XX XX:XX:XX,XXX WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception from container-launch with container ID: container_e14_14XXXXXXXXXXX_XXXXX_01_000001 and
 exit code: 12
ExitCodeException exitCode=12: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
        at org.apache.hadoop.util.Shell.run(Shell.java:504)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:373)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

I've searched the documentation for this exit code but it's not included in the standard YARN exit codes:

https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_sg_yarn_container_exec_errors.htm...

 

Does anyone knows what this exit code 12 means?

 

2 REPLIES 2

avatar
Champion
I would track down the logs for container container_e14_14XXXXXXXXXXX_XXXXX_01_000001. That should contain more details on the actual error.

avatar
Contributor

In the logs for the ApplicationMaster/SparkDriver (which was around 4GB) I've found a StackOverflowError from Spark reporter thread: I've found this Spark issue https://issues.apache.org/jira/browse/SPARK-18750 that matches my error.

 

The job was launched used dynamicAllocation and requested an insane number of containers (16000 with 20GB/8cores) and apparently this can cause a SO in the Spark thread managing the executors.

 

An easy workaround is to disable dynamicAllocation and use a fixed number of executor. With 10 executors the job is running fine.