Support Questions

leo_adnan · ‎04-08-2016

When I submit a spark job using below command,

spark-submit --num-executors 10 --executor-cores 5 --executor-memory 2G --master yarn-cluster --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true --class com.example.SparkJob target/scala-2.10/spark-poc-assembly-0.1.jar 10.0.201.6 hdfs:///user/aahmed/example.csv

It gives me these messages on console. I want to see org.apache.spark INFO level message. How and where can I configure this?

16/04/08 15:09:50 INFO Client: Application report for application_1460098549233_0013 (state: RUNNING)

16/04/08 15:09:51 INFO Client: Application report for application_1460098549233_0013 (state: RUNNING)

16/04/08 15:09:52 INFO Client: Application report for application_1460098549233_0013 (state: RUNNING)

16/04/08 15:09:53 INFO Client: Application report for application_1460098549233_0013 (state: RUNNING)

16/04/08 15:09:54 INFO Client: Application report for application_1460098549233_0013 (state: RUNNING)

sball · ‎04-08-2016

To configure log levels, add

--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.properties" 
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.properties"

This assumes you have a file called log4j-spark.properties on the classpath (usually in resources for the project you're using to build the jar. This log4j can then control the verbosity of spark's logging.

I usually use something derived from the spark default, with some customisation like:

# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n


# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR


# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.logger.org.apache.spark.sql=WARN

# Logging for this application
log4j.logger.com.myproject=INFO

Something else to note here is that in yarn cluster mode, all your important logs (especially the executor logs) will be aggregated by the YARN ATS when the application finishes. You can get these with

yarn logs -applicationId <application>

This will show you all the log based on your config levels.

vshukla · ‎04-08-2016

You can also switch to yarn-client mode to see more logs printed directly onto the console. Remember to switch back to yarn-cluster mode after you are done debugging.

schausson · ‎02-20-2017

Hi Simmon,

Thanks for the tip. Anyway, I have an additional question : I guess this configuration works in "yarn-cluster" mode (when driver and executors run under yarn responsibility on the cluster nodes) ?

My problem comes from the fact that I perform my spark-submit in "yarn-client" mode, which means that my driver is not managed by yarn, and the consequence is that the logs from the driver application go to the console from the server where I performed my "spark-submit" command. As this is a long-run job (several days), I would like to redirect the driver's logs to dedicated file thanks to log4j configuration, but couldn't succeed with this configuration...? Any idea how to achieve this ?

Thanks again

sball · ‎02-20-2017

@Sebastian Carroll These options will work in both yarn-client and yarn-cluster mode. What you will need to do is ensure you have an appropriate file appender in the log4j configuration.

That said, if you have a job which is running for multiple days, you are far far better off using yarn-cluster mode to ensure the driver is safely located on the cluster, rather than relying on a single node with a yarn-client hooked to it.

schausson · ‎02-20-2017

Well,couldn't make it work...

I tried out several options :

I successfully got drivers logs in dedicated log file when using following option with my "spark-submit" command line : --driver-java-options "-Dlog4j.configuration=file:///local/home/.../log4j.properties"

Couldn't obtain the same with your suggestion : --conf "spark.driver.extraJavaOptions=...

For executors' logs, I gave a try with your suggestion as well : --conf "spark.executor.extraJavaOptions=... but failed to notice any change to logging mechanism. I guess this is a classpath issue, but couldn't find any relevant example in the documentation 😞

If I use --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties", where should I put this log4j.properties file ? in the "root" folder of the fat jar that I pass to spark-submit command ? somewhere else ?

Note that I also tried with :

--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///local/home/.../log4j.properties" to point to an external file (not in the jar file) but it failed too...

Any idea about something wrong in my configuration ?

Thanks for your help

schausson · ‎02-20-2017

Thanks a lot, I will give a new try to this configuration with yarn-client mode.

To answer to your question : yes I would really prefer using yarn-cluster mode for my job, but I couldn't make it work at the moment (to summarize : my job requires to access HBase for R/W operations, and my cluster is secured thanks to kerberos. When I launch my job with yarn-cluster mode, I face an authentication problem when reaching HBase. I found a workaround in yarn-client mode but no idea how to solve this issue at the moment in yarn-cluster mode)

Cloudera Community

Support Questions

Spark job submit log messages on console

How to Simplify Spark-Submit JAR Dependency Manage...

NiFi submitting Spark jobs in batch mode

Cloudera Data Engineering Spark Job with Python Wh...

Submit a Spark Job to CDP Data Hub using the Livy ...

Creating a CDE Job with Spark Application Code loc...

Working with CDE Spark Job Parameters in Cloudera ...

How to : capture Spark Driver and Executor Logs in...

Submit Spark jobs to Livy on CDP Public Cloud Data...

Ranger audit log for Stream Messaging Manager serv...

scheduling a spark-submit job using oozie