I am trying to find the root cause of recent Spark application failure in production. When the Spark application is running I can check NodeManager's yarn.nodemanager.log-dir property to get the Spark executor container logs.
The container has logs for both the running Spark applications
Here is the view of the container logs: drwx--x--- 3 yarn yarn 51 Jul 19 09:04 application_1467068598418_0209 drwx--x--- 5 yarn yarn 141 Jul 19 09:04 application_1467068598418_0210
But when the application is killed both the application logs are automatically deleted. I have set all the log retention setting etc in Yarn to a very large number. But still these logs are deleted as soon as the Spark applications are crashed.
Question: How can we retain these Spark application logs in Yarn for debugging when the Spark application is crashed for some reason.
Yarn uses log aggregation to move logs from the worker nodes into HDFS and will delete the logs from the workers when finished. You can retrieve the logs using yarn cli:
yarn logs -applicationId <application ID>
ex yarn logs -applicationId application_1467068598418_0210