Hadoop and its ecosystem projects are distributed systems. Hadoop provides foundational services,such as distributed file system, cluster wide resource management, that many other projects (e.g OOzie, MR Jobs, Tez, Hive Queries, Spark Jobs) leverage.
With this layered approach a Hive Query, or a Spark job leads to sequence of operations that span multiple systems. For example, a Spark job may leverage YARN to get containers to run, it may also read from HDFS.
Tracing and debugging problems in distributed systems is hard since it involves tracing call across many systems.
Hadoop resolved this problem. Hadoop has implemented a feature of log tracing – caller context (HDFS-9184 andYARN-4349). It can help users to better diagnose and understand how specific applications impact parts of the Hadoop system and potential problems they may be creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184 , for a given HDFS operation, it's very helpful to track which upper level job issues it. The upper level callers may be specific Oozie tasks, MR jobs, hive queries, or Spark jobs.
Hadoop ecosystems like MapReduce, Tez (TEZ-2851), Hive (HIVE-12249, HIVE-12254) and Pig(PIG-4714) have implemented their caller contexts. Those systems invoke HDFS/YARN APIs to set up their caller contexts, and also expose an API to pass in caller context.
We have now implemented in Spark the ability to provide its caller context via invoking Hadoop APIs. We have also exposed an API for Spark applications to set up their caller contexts. In the end, Spark will combine its own caller context with the caller contexts of its upstream applications, and write them into YARN RM log and HDFS audit log. This allow the ability to trace the logs from Spark down to YARN & HDFS resources.
Two ways to configure `spark.log.callerContext` property:
When running in Spark YARN cluster mode, the driver is unable to pass 'spark.log.callerContext' to YARN client and AM since YARN client and AM have already started before the driver performs .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp"). So for YARN cluster mode, users should configure `spark.log.callerContext` property in spark-defaults.conf instead of in app’s source code.
The following example shows the command line used to submit a SparkKMeans application and the corresponding records in YARN RM log and HDFS audit log.