Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to do logging in Spark Applications without using actions in logger statements?

avatar
Contributor

I am trying to capture the logs for my application before and after the Spark Transformation statement. Being Lazy in evaluation the logs get printed before a transformation is actually evaluated. Is there a way to capture logs without calling any Spark action in log statements, avoiding unnecessary CPU consumption?

1 ACCEPTED SOLUTION

avatar

The spark logging code is Spark's Logger class, which does lazy eval of expressions like

logInfo(s"status $value") 

Sadly, that's private to the spark code, so outside it you can't use it. See [SPARK-13928](https://issues.apache.org/jira/browse/SPARK-13928) for the discussion, and know that I don't really agree with the decision.

When I was moving some code from org.apache.spark to a different package, I ended up having to copy & paste the entire spark logging class into my own code. Not ideal, but it works: CloudLogging.scala

Bear in mind that underneath, Spark uses SLF4J and whatever back it, such as log4j; you can use SLF4J direct for its lazy eval of log.info("status {}", value). However, the spark lazy string evaluation is easier to use, and I believe is even lazy about evaluating functions inside the strings (.e.g. s"users = ${users.count()}"), so can be more efficient.

The CloudLogging class I've linked to shows how Spark binds to SLF4J; feel free to grab and use it,

View solution in original post

6 REPLIES 6

avatar

Hi Puneet: I'm not 100% certain I understand your question, but let me suggest:

If you have a DataFrame or RDD (resilient distributed dataset in memory), and you want to see before/after state for a given Transformation, you could run a relatively low-cost action like take() or sample() to print a few elements from your dataframe. These are relatively low cost operations which only return a few elements to the driver. Full documentation for DataFrame.take() is here:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

Excerpt here:

DataFrame class:
def take(n: Int): Array[Row]
Returns the first n rows in the DataFrame.
Running take requires moving data into the applications driver process, and doing so with a very large 'n' can crash the driver process with OutOfMemoryError.

avatar
Contributor

Thanks for the input. Yes that is a solution but I don't want to call any action as I mentioned. So, what I am expecting is some solution like SparkContext.getLogger().info("message") which will be Lazy evaluated when the action is called at last.

avatar
Master Guru

as long as logging is on, a lot will show in the history and in logs.

For Spark Job setup:

sparkConf.set("spark.eventLog.enabled","true")

Then check the Spark History Server

You can also put on old fashioned Java logging

import org.apache.log4j.{Level, Logger}    

val logger: Logger = Logger.getLogger("My.Example.Code.Rules")
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark.storage.BlockManager").setLevel(Level.ERROR)
logger.setLevel(Level.INFO)

You can set it to info, but expect a lot of junk.

avatar
Contributor

Yes, relying on Spark logs is a solution to this but it does take away the freedom to log custom messages. So, what I am expecting is some solution like SparkContext.getLogger().info("message") which will be Lazy evaluated when the action is called at last.

avatar
New Contributor

Hi Puneet - Were you able to solve this problem? I have a similar requirement but not sure how to enable lazy eval for logging purpose. And I am trying to stay away from inducing actions like .first or .take as my files are huge.

I found this link - http://stackoverflow.com/questions/29208844/apache-spark-logging-within-scala. But it seems to be not working with my code.

avatar

The spark logging code is Spark's Logger class, which does lazy eval of expressions like

logInfo(s"status $value") 

Sadly, that's private to the spark code, so outside it you can't use it. See [SPARK-13928](https://issues.apache.org/jira/browse/SPARK-13928) for the discussion, and know that I don't really agree with the decision.

When I was moving some code from org.apache.spark to a different package, I ended up having to copy & paste the entire spark logging class into my own code. Not ideal, but it works: CloudLogging.scala

Bear in mind that underneath, Spark uses SLF4J and whatever back it, such as log4j; you can use SLF4J direct for its lazy eval of log.info("status {}", value). However, the spark lazy string evaluation is easier to use, and I believe is even lazy about evaluating functions inside the strings (.e.g. s"users = ${users.count()}"), so can be more efficient.

The CloudLogging class I've linked to shows how Spark binds to SLF4J; feel free to grab and use it,