Support Questions
Find answers, ask questions, and share your expertise

How to do logging in Spark Applications without using actions in logger statements?

Explorer

I am trying to capture the logs for my application before and after the Spark Transformation statement. Being Lazy in evaluation the logs get printed before a transformation is actually evaluated. Is there a way to capture logs without calling any Spark action in log statements, avoiding unnecessary CPU consumption?

1 ACCEPTED SOLUTION

Accepted Solutions

The spark logging code is Spark's Logger class, which does lazy eval of expressions like

logInfo(s"status $value") 

Sadly, that's private to the spark code, so outside it you can't use it. See [SPARK-13928](https://issues.apache.org/jira/browse/SPARK-13928) for the discussion, and know that I don't really agree with the decision.

When I was moving some code from org.apache.spark to a different package, I ended up having to copy & paste the entire spark logging class into my own code. Not ideal, but it works: CloudLogging.scala

Bear in mind that underneath, Spark uses SLF4J and whatever back it, such as log4j; you can use SLF4J direct for its lazy eval of log.info("status {}", value). However, the spark lazy string evaluation is easier to use, and I believe is even lazy about evaluating functions inside the strings (.e.g. s"users = ${users.count()}"), so can be more efficient.

The CloudLogging class I've linked to shows how Spark binds to SLF4J; feel free to grab and use it,

View solution in original post

6 REPLIES 6

Hi Puneet: I'm not 100% certain I understand your question, but let me suggest:

If you have a DataFrame or RDD (resilient distributed dataset in memory), and you want to see before/after state for a given Transformation, you could run a relatively low-cost action like take() or sample() to print a few elements from your dataframe. These are relatively low cost operations which only return a few elements to the driver. Full documentation for DataFrame.take() is here:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

Excerpt here:

DataFrame class:
def take(n: Int): Array[Row]
Returns the first n rows in the DataFrame.
Running take requires moving data into the applications driver process, and doing so with a very large 'n' can crash the driver process with OutOfMemoryError.

Explorer

Thanks for the input. Yes that is a solution but I don't want to call any action as I mentioned. So, what I am expecting is some solution like SparkContext.getLogger().info("message") which will be Lazy evaluated when the action is called at last.

Super Guru

as long as logging is on, a lot will show in the history and in logs.

For Spark Job setup:

sparkConf.set("spark.eventLog.enabled","true")

Then check the Spark History Server

You can also put on old fashioned Java logging

import org.apache.log4j.{Level, Logger}    

val logger: Logger = Logger.getLogger("My.Example.Code.Rules")
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark.storage.BlockManager").setLevel(Level.ERROR)
logger.setLevel(Level.INFO)

You can set it to info, but expect a lot of junk.

Explorer

Yes, relying on Spark logs is a solution to this but it does take away the freedom to log custom messages. So, what I am expecting is some solution like SparkContext.getLogger().info("message") which will be Lazy evaluated when the action is called at last.

New Contributor

Hi Puneet - Were you able to solve this problem? I have a similar requirement but not sure how to enable lazy eval for logging purpose. And I am trying to stay away from inducing actions like .first or .take as my files are huge.

I found this link - http://stackoverflow.com/questions/29208844/apache-spark-logging-within-scala. But it seems to be not working with my code.

The spark logging code is Spark's Logger class, which does lazy eval of expressions like

logInfo(s"status $value") 

Sadly, that's private to the spark code, so outside it you can't use it. See [SPARK-13928](https://issues.apache.org/jira/browse/SPARK-13928) for the discussion, and know that I don't really agree with the decision.

When I was moving some code from org.apache.spark to a different package, I ended up having to copy & paste the entire spark logging class into my own code. Not ideal, but it works: CloudLogging.scala

Bear in mind that underneath, Spark uses SLF4J and whatever back it, such as log4j; you can use SLF4J direct for its lazy eval of log.info("status {}", value). However, the spark lazy string evaluation is easier to use, and I believe is even lazy about evaluating functions inside the strings (.e.g. s"users = ${users.count()}"), so can be more efficient.

The CloudLogging class I've linked to shows how Spark binds to SLF4J; feel free to grab and use it,

View solution in original post