I've a yarn application launched via oozie in yarn-cluster mode that sometimes fails for an unknown error.
The stdout and stderr logs from the driver don't any error (they are cutoff in the middle of some INFO messages), but I've found a strange error in the log of the NodeManager running the AM container:
2017-XX-XX XX:XX:XX,XXX WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception from container-launch with container ID: container_e14_14XXXXXXXXXXX_XXXXX_01_000001 and exit code: 12 ExitCodeException exitCode=12: at org.apache.hadoop.util.Shell.runCommand(Shell.java:601) at org.apache.hadoop.util.Shell.run(Shell.java:504) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:373) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
I've searched the documentation for this exit code but it's not included in the standard YARN exit codes:
Does anyone knows what this exit code 12 means?
In the logs for the ApplicationMaster/SparkDriver (which was around 4GB) I've found a StackOverflowError from Spark reporter thread: I've found this Spark issue https://issues.apache.org/jira/browse/SPARK-18750 that matches my error.
The job was launched used dynamicAllocation and requested an insane number of containers (16000 with 20GB/8cores) and apparently this can cause a SO in the Spark thread managing the executors.
An easy workaround is to disable dynamicAllocation and use a fixed number of executor. With 10 executors the job is running fine.