Community Articles

wyang · ‎05-18-2017

Background

Hadoop and its ecosystem projects are distributed systems. Hadoop provides foundational services,such as distributed file system, cluster wide resource management, that many other projects (e.g OOzie, MR Jobs, Tez, Hive Queries, Spark Jobs) leverage.

With this layered approach a Hive Query, or a Spark job leads to sequence of operations that span multiple systems. For example, a Spark job may leverage YARN to get containers to run, it may also read from HDFS.

Tracing and debugging problems in distributed systems is hard since it involves tracing call across many systems.

Hadoop resolved this problem. Hadoop has implemented a feature of log tracing – caller context (HDFS-9184 andYARN-4349). It can help users to better diagnose and understand how specific applications impact parts of the Hadoop system and potential problems they may be creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184 , for a given HDFS operation, it's very helpful to track which upper level job issues it. The upper level callers may be specific Oozie tasks, MR jobs, hive queries, or Spark jobs.

Hadoop ecosystems like MapReduce, Tez (TEZ-2851), Hive (HIVE-12249, HIVE-12254) and Pig(PIG-4714) have implemented their caller contexts. Those systems invoke HDFS/YARN APIs to set up their caller contexts, and also expose an API to pass in caller context.

We have now implemented in Spark the ability to provide its caller context via invoking Hadoop APIs. We have also exposed an API for Spark applications to set up their caller contexts. In the end, Spark will combine its own caller context with the caller contexts of its upstream applications, and write them into YARN RM log and HDFS audit log. This allow the ability to trace the logs from Spark down to YARN & HDFS resources.

Usage

Two ways to configure `spark.log.callerContext` property:

In spark-defaults.conf:

spark.log.callerContext infoSpecifiedByUpstreamApp

or:

In app's source code (only works for client mode):

val spark = SparkSession

.builder

.appName("SparkKMeans")

.config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")

.getOrCreate()

When running in Spark YARN cluster mode, the driver is unable to pass 'spark.log.callerContext' to YARN client and AM since YARN client and AM have already started before the driver performs .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp"). So for YARN cluster mode, users should configure `spark.log.callerContext` property in spark-defaults.conf instead of in app’s source code.

Example

The following example shows the command line used to submit a SparkKMeans application and the corresponding records in YARN RM log and HDFS audit log.

Command line:

./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5

Spark caller contexts written into YARN RM log and HDFS audit log:

Yarn RM log:

…

2017-02-08 16:04:14,856 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wyang IP=127.0.0.1 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESSAPPID=application_1486594207699_0011 CALLERCONTEXT=SPARK_CLIENT_application_1486594207699_0011_infoSpecifiedByUpstreamApp

HDFS audit log:

…

2017-02-08 16:04:14,806 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=getfileinfo src=/.sparkStaging/application_1486594207699_0011/__spark_conf__.zip dst=null perm=null proto=rpc callerContext=SPARK_CLIENT_application_1486594207699_0011_infoSpecifiedByUpstreamApp

2017-02-08 16:04:14,807 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=getfileinfo src=/.sparkStaging/application_1486594207699_0011/__spark_conf__.zip dst=null perm=null proto=rpc callerContext=SPARK_CLIENT_application_1486594207699_0011_infoSpecifiedByUpstreamApp

…

2017-02-08 16:04:22,725 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=getfileinfo src=/spark-history dst=null perm=null proto=rpc callerContext=SPARK_APPMASTER_application_1486594207699_0011_1_infoSpecifiedByUpstreamApp

2017-02-08 16:04:22,742 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=create src=/spark-history/application_1486594207699_0011_1.lz4.inprogress dst=null perm=wyang:supergroup:rw-r--r-- proto=rpc callerContext=SPARK_APPMASTER_application_1486594207699_0011_1_infoSpecifiedByUpstreamApp

…

2017-02-08 16:04:29,671 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=open src=/lr_big.txt dst=null perm=null proto=rpc callerContext=SPARK_TASK_application_1486594207699_0011_1_JId_0_SId_0_0_TId_1_0_infoSpecifiedByUpstreamApp

2017-02-08 16:04:29,672 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=open src=/lr_big.txt dst=null perm=null proto=rpc callerContext=SPARK_TASK_application_1486594207699_0011_1_JId_0_SId_0_0_TId_0_0_infoSpecifiedByUpstreamApp

2017-02-08 16:04:29,672 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1 cmd=open src=/lr_big.txt dst=null perm=null proto=rpc callerContext=SPARK_TASK_application_1486594207699_0011_1_JId_0_SId_0_0_TId_2_0_infoSpecifiedByUpstreamApp

Cloudera Community

Community Articles

Providing tracing with Spark Caller Context

Apache Hadoop

Apache Spark

Usage

Example

Command line:

Spark caller contexts written into YARN RM log and HDFS audit log:

Yarn RM log:

HDFS audit log:

Spark: avoiding stack traces when starting spark-s...

Spark 3 legacy configurations list ( Spark 2 behav...

Spark Python Supportability Matrix

Spark and Java versions Supportability Matrix

Spark Scala Version Compatibility Matrix

cdsw spark context issue

Customizing Ranger Policies with Dynamic Context

Tracing from HBase Shell

TRACE logging for Phoenix

Spark Python Integration Test Result Exceptions