This video explains feasible and efficient ways to troubleshoot performance or perform root-cause analysis on any Spark streaming application, which usually tend to grow over the gigabyte size. However, this article does not cover yarn-client mode as it is recommended to use yarn-cluster for streaming applications due to reasons that will not be discussed on this article.
Spark streaming applications usually run for long periods of time, before facing issues that may cause them to be shut down. In other cases, the application will not even be shut down, but it could be facing performance degradation during certain peak hours. In any case, the amount and size of this log will keep growing over time, making it really difficult to analyze when they start growing past the gigabyte size.
It's well known that Spark, as many other applications, uses log4j facility to handle logs for both the driver and the executors, hence it is recommended to tune the file, to leverage the rolling file appender option, which will basically create a log file, rotate it when a size limit is met, and keep a number of backup logs as historical information that we can later on use for analysis.
Updating the file in the Spark configuration directory is not recommended, as it will have a cluster-wide effect, instead we can use it as a template to create our own log4j file that is going to be used for our streaming application without affecting other jobs.
As an example, in this video, a file is created from scratch to meet the following conditions:
log4j.rootLogger=INFO, rolling log4j.appender.rolling=org.apache.log4j.RollingFileAppender log4j.appender.rolling.layout=org.apache.log4j.PatternLayout log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n log4j.appender.rolling.maxFileSize=100MB log4j.appender.rolling.maxBackupIndex=10 log4j.appender.rolling.file=${}/${}-driver.log log4j.appender.rolling.encoding=UTF-8${vm.logging.level}
log4j.rootLogger=INFO, rolling log4j.appender.rolling=org.apache.log4j.RollingFileAppender log4j.appender.rolling.layout=org.apache.log4j.PatternLayout log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n log4j.appender.rolling.maxFileSize=100MB log4j.appender.rolling.maxBackupIndex=10 log4j.appender.rolling.file=${}/${}-executor.log log4j.appender.rolling.encoding=UTF-8${vm.logging.level}
spark-submit --master yarn --deploy-mode cluster --num-executors 3 \ --conf " \ -Dvm.logging.level=DEBUG" \ --conf " \ -Dvm.logging.level=DEBUG" \ --files key.conf,test.keytab,, \ --jars spark-streaming_2.11-, \ --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:, org.apache.spark:spark-streaming_2.11: \ --class org.apache.spark.examples.streaming.KafkaWordCount \ /usr/hdp/ \ node2.fqdn,node3.fqdn,node4.fqdn \ my-consumer-group receiver 2 PLAINTEXTSASL
spark-submit --master yarn --deploy-mode cluster \ --num-executors 3 \ --files, \ --conf " -Dvm.logging.level=DEBUG" \ --conf " -Dvm.logging.level=DEBUG" \ --class org.apache.spark.examples.SparkPi \ /usr/hdp/current/spark2-client/examples/jars/spark-examples_*.jar 1000
After running the Spark streaming application, the following information will be listed in NodeManager nodes where an executor is launched:
This way it's easier to find and collect the necessary executor logs. Also, from the Resource Manager UI, the current log and any previous (backup) file will be listed:
Created on 05-28-2020 12:04 AM
what is the directory of should I put?
Created on 06-02-2020 07:22 AM
The, can be anywhere in your filesystem, just make sure to reference them from the right location in the files arguments section:
--files key.conf,test.keytab,/path/to/,/path/to/
If you have a workspace in your home directory, then it can safely be located in your current path, upon using the spark-submit spark client, --files will look for both in the CPW unless otherwise specified.
Created on 07-15-2021 07:40 AM
for cluster mode it can be put on hdfs location as well. And can be referenced from there in files argument of spark-submit script.
--files hdfs://namenode:8020/