Created 05-08-2021 11:27 PM
I am building a log analysis planform to monitor spark jobs on a yarn cluster and I want to get a clear idea about spark/yarn logging. I have searched a lot about this and these are the confusions I have.
The directory specified in spark.eventLog.dir or spark.history.fs.logDirectory get stored all the application master logs and through log4j.properties in spark conf we can customize those logs ?
In default all data nodes output their executor logs to a folder in /var/log/. with log-aggregation enabled you can get those executer logs to the spark.eventLog.dir location as well?
I've managed to set up a 3 node virtual hadoop yarn cluster, spark installed in the master node. When I'm running spark in client mode I'm thinking this node becomes the application master node. I'm a beginner to Big data and appreciate any effort to help me out with these confusions.
Thanks for using Cloudera Community & we hope to assist you in your Big Data Learning.
To your Queries, Please find the required details below:
(I) When you are running the Job in Client Mode (Like Spark-Shell), the Driver runs on the Local Node wherein the Job is being executed. As such, the Driver Logs is printed in the Console itself. As you mentioned YARN Mode, the Application Master & the Executors are being launched in NodeManagers. In Cluster Mode, the Driver is launched in Application Master JVM & the Driver Logs is captured in the Application Master Logs.
(II) Yes, the 2 Directories specified by your Team refers to the Event Logs. You haven't mentioned whether you are using any Orchestration Tool (Ambari, CM). As such, the Log4j needs to be edited to reflect the same. Link  refers to a Topic with similar ask.
(III) In Spark on YARN Mode, there is 3 Set of Logs:
Kindly review & let us know if your ask is answered. Else, Do post your queries & we shall assist you.