Created 06-19-2017 07:55 AM
Hi
In spark you have the ability to log events to file and later this information is read by the history server to view in the browser. The list of the events that are logged are mentioned here:
https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.scheduler.SparkListener
When using Spark Streaming, an additional tab becomes available in the Web UI which gives information regarding the Micro batches, processing time, number of records etc. Unfortunately this information is not available within the logs and nor does the History Server display them on replay of the logs. I have not found any configuration to enable logging the streaming events. How can i capture this information and lets say write it to a log file?
Thanks
Arsalan
Created 06-20-2017 05:13 AM
Hi @Arsalan Siddiqi,
This is the support KB article which mentioned to ship the logs to HDFS which enables the Event log information be captured at HDFS Level
1. Create the log directory in HDFS to store spark job events, and set ownership and permissions, for example ( Create /spark/applicationHistory/ as any HDFS super user):
# hdfs dfs -mkdir /spark # hdfs dfs -mkdir /spark/applicationHistory # hdfs dfs -chown -R spark:spark /spark # hdfs dfs -chmod 1777 /spark/applicationHistory
2. Add following properties to spark-defaults.conf by using Ambari (Ambari UI -> Spark -> Configs -> Custom spark-defaults -> Add Property)
spark.history.fs.logDirectory=hdfs://<namenode_host>:<namenode_port>/spark/applicationHistory spark.eventLog.dir=hdfs://<namenode_host>:<namenode_port>/spark/applicationHistory spark.eventLog.enabled=true
Or (if HDFS HA is enable):
spark.history.fs.logDirectory=hdfs://<name_service_id>/spark/applicationHistory spark.eventLog.dir=hdfs://<name_service_id>/spark/applicationHistory spark.eventLog.enabled=true
3. Update spark.history.provider in spark-default.conf using Ambari (Ambari UI -> Spark -> Configs -> Advanced spark-defaults)
spark.history.provider = org.apache.spark.deploy.history.FsHistoryProvider
4. Restart Spark History Server.
Created 06-20-2017 07:59 AM
hi @bkosaraju
Thanks for the reply. I do have the history server configured and running capturing the events to the specified directory (also I am not using HDP, I am using spark standalone and spark from intellij). The issue is that within the history server the streaming events are not captured. The details for the batch.
Although I have overridden the streaming event listener OnBatchSubmit and added code to write to a log file.