Created 05-13-2016 01:50 PM
I am trying to capture the logs for my application before and after the Spark Transformation statement. Being Lazy in evaluation the logs get printed before a transformation is actually evaluated. Is there a way to capture logs without calling any Spark action in log statements, avoiding unnecessary CPU consumption?
Created 12-14-2016 02:48 PM
The spark logging code is Spark's Logger class, which does lazy eval of expressions like
logInfo(s"status $value")
Sadly, that's private to the spark code, so outside it you can't use it. See [SPARK-13928](https://issues.apache.org/jira/browse/SPARK-13928) for the discussion, and know that I don't really agree with the decision.
When I was moving some code from org.apache.spark to a different package, I ended up having to copy & paste the entire spark logging class into my own code. Not ideal, but it works: CloudLogging.scala
Bear in mind that underneath, Spark uses SLF4J and whatever back it, such as log4j; you can use SLF4J direct for its lazy eval of log.info("status {}", value). However, the spark lazy string evaluation is easier to use, and I believe is even lazy about evaluating functions inside the strings (.e.g. s"users = ${users.count()}"), so can be more efficient.
The CloudLogging class I've linked to shows how Spark binds to SLF4J; feel free to grab and use it,
Created 05-13-2016 06:34 PM
Hi Puneet: I'm not 100% certain I understand your question, but let me suggest:
If you have a DataFrame or RDD (resilient distributed dataset in memory), and you want to see before/after state for a given Transformation, you could run a relatively low-cost action like take() or sample() to print a few elements from your dataframe. These are relatively low cost operations which only return a few elements to the driver. Full documentation for DataFrame.take() is here:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
Excerpt here:
DataFrame class: def take(n: Int): Array[Row] Returns the first n rows in the DataFrame. Running take requires moving data into the applications driver process, and doing so with a very large 'n' can crash the driver process with OutOfMemoryError.
Created 05-14-2016 03:15 PM
Thanks for the input. Yes that is a solution but I don't want to call any action as I mentioned. So, what I am expecting is some solution like SparkContext.getLogger().info("message") which will be Lazy evaluated when the action is called at last.
Created 05-13-2016 06:45 PM
as long as logging is on, a lot will show in the history and in logs.
For Spark Job setup:
sparkConf.set("spark.eventLog.enabled","true")
Then check the Spark History Server
You can also put on old fashioned Java logging
import org.apache.log4j.{Level, Logger} val logger: Logger = Logger.getLogger("My.Example.Code.Rules") Logger.getLogger("org.apache.spark").setLevel(Level.WARN) Logger.getLogger("org.apache.spark.storage.BlockManager").setLevel(Level.ERROR) logger.setLevel(Level.INFO)
You can set it to info, but expect a lot of junk.
Created 05-14-2016 03:17 PM
Yes, relying on Spark logs is a solution to this but it does take away the freedom to log custom messages. So, what I am expecting is some solution like SparkContext.getLogger().info("message") which will be Lazy evaluated when the action is called at last.
Created 12-14-2016 02:39 AM
Hi Puneet - Were you able to solve this problem? I have a similar requirement but not sure how to enable lazy eval for logging purpose. And I am trying to stay away from inducing actions like .first or .take as my files are huge.
I found this link - http://stackoverflow.com/questions/29208844/apache-spark-logging-within-scala. But it seems to be not working with my code.
Created 12-14-2016 02:48 PM
The spark logging code is Spark's Logger class, which does lazy eval of expressions like
logInfo(s"status $value")
Sadly, that's private to the spark code, so outside it you can't use it. See [SPARK-13928](https://issues.apache.org/jira/browse/SPARK-13928) for the discussion, and know that I don't really agree with the decision.
When I was moving some code from org.apache.spark to a different package, I ended up having to copy & paste the entire spark logging class into my own code. Not ideal, but it works: CloudLogging.scala
Bear in mind that underneath, Spark uses SLF4J and whatever back it, such as log4j; you can use SLF4J direct for its lazy eval of log.info("status {}", value). However, the spark lazy string evaluation is easier to use, and I believe is even lazy about evaluating functions inside the strings (.e.g. s"users = ${users.count()}"), so can be more efficient.
The CloudLogging class I've linked to shows how Spark binds to SLF4J; feel free to grab and use it,