Support Questions

sakitha · ‎05-08-2021

I am building a log analysis planform to monitor spark jobs on a yarn cluster and I want to get a clear idea about spark/yarn logging. I have searched a lot about this and these are the confusions I have.

The directory specified in spark.eventLog.dir or spark.history.fs.logDirectory get stored all the application master logs and through log4j.properties in spark conf we can customize those logs ?
In default all data nodes output their executor logs to a folder in /var/log/. with log-aggregation enabled you can get those executer logs to the spark.eventLog.dir location as well?

I've managed to set up a 3 node virtual hadoop yarn cluster, spark installed in the master node. When I'm running spark in client mode I'm thinking this node becomes the application master node. I'm a beginner to Big data and appreciate any effort to help me out with these confusions.

smdas · ‎05-10-2021

Hello @sakitha

Thanks for using Cloudera Community & we hope to assist you in your Big Data Learning.

To your Queries, Please find the required details below:

(I) When you are running the Job in Client Mode (Like Spark-Shell), the Driver runs on the Local Node wherein the Job is being executed. As such, the Driver Logs is printed in the Console itself. As you mentioned YARN Mode, the Application Master & the Executors are being launched in NodeManagers. In Cluster Mode, the Driver is launched in Application Master JVM & the Driver Logs is captured in the Application Master Logs.

(II) Yes, the 2 Directories specified by your Team refers to the Event Logs. You haven't mentioned whether you are using any Orchestration Tool (Ambari, CM). As such, the Log4j needs to be edited to reflect the same. Link [1] refers to a Topic with similar ask.

(III) In Spark on YARN Mode, there is 3 Set of Logs:

Spark Event Logs from the Event Log Directory (This is the Source of the Information for Spark UI),
YARN Application Logs. You can fetch the same via CLI with the Application ID as shared via [2],
The Logging Directory "/var/log" holds the Service based Logs like NodeManager, ResourceManager, DataNodes etc. If we assume any Service Level issue impacts the Job, We can review the Service Logs within the concerned Directory.

Kindly review & let us know if your ask is answered. Else, Do post your queries & we shall assist you.

- Smarak

[1] https://stackoverflow.com/questions/32001248/whats-the-difference-between-spark-eventlog-dir-and-spa...

[2] https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-operating-system/content/use_the_yarn_cli...

smdas · ‎05-21-2021

Hello @sakitha

Kindly let us know if your queries posted in the Post have been answered by us. If No, Do share your concerns. If Yes, Please mark the Post as Resolved.

Thanks, Smarak

Support Questions

Which directory spark applications on yarn output their logs to? spark.eventLog.dir or var/log/ in each node?