Created on 05-18-2017 08:50 PM
Background
Hadoop and its ecosystem projects are distributed systems. Hadoop provides foundational services,such as distributed file system, cluster wide resource management, that many other projects (e.g OOzie, MR Jobs, Tez, Hive Queries, Spark Jobs) leverage.
With this layered approach a Hive Query, or a Spark job leads to sequence of operations that span multiple systems. For example, a Spark job may leverage YARN to get containers to run, it may also read from HDFS.
Tracing and debugging problems in distributed systems is hard since it involves tracing call across many systems.
Hadoop resolved this problem. Hadoop has implemented a feature of log tracing – caller context (HDFS-9184 andYARN-4349). It can help users to better diagnose and understand how specific applications impact parts of the Hadoop system and potential problems they may be creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184 , for a given HDFS operation, it's very helpful to track which upper level job issues it. The upper level callers may be specific Oozie tasks, MR jobs, hive queries, or Spark jobs.
Hadoop ecosystems like MapReduce, Tez (TEZ-2851), Hive (HIVE-12249, HIVE-12254) and Pig(PIG-4714) have implemented their caller contexts. Those systems invoke HDFS/YARN APIs to set up their caller contexts, and also expose an API to pass in caller context.
We have now implemented in Spark the ability to provide its caller context via invoking Hadoop APIs. We have also exposed an API for Spark applications to set up their caller contexts. In the end, Spark will combine its own caller context with the caller contexts of its upstream applications, and write them into YARN RM log and HDFS audit log. This allow the ability to trace the logs from Spark down to YARN & HDFS resources.
Two ways to configure `spark.log.callerContext` property:
In spark-defaults.conf:
spark.log.callerContext infoSpecifiedByUpstreamApp
or:
In app's source code (only works for client mode):
val spark = SparkSession
.builder
.appName("SparkKMeans")
.config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")
.getOrCreate()
When running in Spark YARN cluster mode, the driver is unable to pass 'spark.log.callerContext' to YARN client and AM since YARN client and AM have already started before the driver performs .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp"). So for YARN cluster mode, users should configure `spark.log.callerContext` property in spark-defaults.conf instead of in app’s source code.
The following example shows the command line used to submit a SparkKMeans application and the corresponding records in YARN RM log and HDFS audit log.
./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5
…
2017-02-08 16:04:14,856 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wyang IP=127.0.0.1 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESSAPPID=application_1486594207699_0011 CALLERCONTEXT=SPARK_CLIENT_application_1486594207699_0011_infoSpecifiedByUpstreamApp
…
2017-02-08 16:04:14,806 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=getfileinfo src=/.sparkStaging/application_1486594207699_0011/__spark_conf__.zip dst=null perm=null proto=rpc callerContext=SPARK_CLIENT_application_1486594207699_0011_infoSpecifiedByUpstreamApp
2017-02-08 16:04:14,807 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=getfileinfo src=/.sparkStaging/application_1486594207699_0011/__spark_conf__.zip dst=null perm=null proto=rpc callerContext=SPARK_CLIENT_application_1486594207699_0011_infoSpecifiedByUpstreamApp
…
2017-02-08 16:04:22,725 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=getfileinfo src=/spark-history dst=null perm=null proto=rpc callerContext=SPARK_APPMASTER_application_1486594207699_0011_1_infoSpecifiedByUpstreamApp
2017-02-08 16:04:22,742 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=create src=/spark-history/application_1486594207699_0011_1.lz4.inprogress dst=null perm=wyang:supergroup:rw-r--r-- proto=rpc callerContext=SPARK_APPMASTER_application_1486594207699_0011_1_infoSpecifiedByUpstreamApp
…
2017-02-08 16:04:29,671 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=open src=/lr_big.txt dst=null perm=null proto=rpc callerContext=SPARK_TASK_application_1486594207699_0011_1_JId_0_SId_0_0_TId_1_0_infoSpecifiedByUpstreamApp
2017-02-08 16:04:29,672 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=open src=/lr_big.txt dst=null perm=null proto=rpc callerContext=SPARK_TASK_application_1486594207699_0011_1_JId_0_SId_0_0_TId_0_0_infoSpecifiedByUpstreamApp
2017-02-08 16:04:29,672 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=open src=/lr_big.txt dst=null perm=null proto=rpc callerContext=SPARK_TASK_application_1486594207699_0011_1_JId_0_SId_0_0_TId_2_0_infoSpecifiedByUpstreamApp