Created 03-07-2016 09:23 PM
For the past month or so, all Spark jobs are either not appearing in the Spark History UI or showing as incomplete. YARN is correctly reporting all jobs, but Spark claims there are more steps yet to be run.
A little background: At one point the logs started filling with errors from the Spark history service about a non-existent file. I ended up stopping the Spark history server and deleting everything in the directory it was yelling about, then restarting. I suspect I damaged something in the process and could use some advice on reinitializing the service.
Created 03-14-2016 04:46 PM
There's some documentation on options you can set; though that doc is for the next iteration of the timeline server, which is slightly different. I should write up a proper tutorial on the subject+slide.
One thing you can do is set the spark context option:
spark.history.yarn.diagnostics true
This adds more detail to the key-value list of items shown on the history page, with some status on "does the history server think the Yarn ATS server is working".
What you've got looks very much YARN timeline server and not spark side. The tests to verify this is simple, put this in your browser
http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO;
-lots of JSON: all is well
-500+error: bad
From the error,
16/03/14 11:25:04 WARN YarnHistoryProvider: Failed to list entities from http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/ java.io.IOException: Bad GET request: status code 500 against http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO; {"exception":"WebApplicationException","message":"java.io.IOException: org.iq80.leveldb.DBException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/005567.sst: No such file or directo at org.apache.spark.deploy.history.yarn.rest.JerseyBinding$.translateException(JerseyBinding.scala:208) at org.apache.spark.deploy.history.yarn.rest.JerseyBinding$.translateException(JerseyBinding.scala:91)
This is not good: it's a logging of the response which came from YARN, the file tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/005567.sst isn't here.
A quick scan of "leveldb SST file not found YARN" turns up YARN-2873, ", related to work-preserving node manager restart; they indicate that the problem is related to leveldb file loss.
1. I think it's wise to move that history store off /tmp ; set the yarn-site property
yarn.timeline-service.leveldb-timeline-store.path to something other than ${hadoop.tmp.dir}/yarn/timeline
Trying that and restarting the Yarn application timeline service will give you a fresh start in directories of persisted data
2. I'm going contact some of my colleagues who work in the yarn-side of things to see if they can provide better insight into what's gone wrong. I fear they may not be able to come up with an answer other than "something happened in levelDB", but we can clearly improve how ATS handles this.
Created 03-14-2016 06:13 PM
I am seeing that same 500 error when working directory from the browser. Moving the timeline storage path is not a problem. I've read some suggestions about moving it to HDFS, but I'm not sure what other ramifications that may have so I'll stick with machine-local storage for now.
Not sure if you saw one of my earlier posts where I mentioned that the Spark daemon log is filling with errors whose meaning is not clear (see beginning of thread above). Perhaps that will go away when I relocate the log directory.
Created 03-14-2016 06:49 PM
I moved the yarn timeline directory to /hadoop/yarn and restarted. I'm no longer seeing the 500 error from the Spark History UI, but it continues to list completed Spark jobs in 'incomplete' telling me that there are hundreds of tasks remaining to be run. The YARN history UI does correctly report that the job is complete. incomplete.png
The developer who owns the application tells me that it appears to be returning proper results.
Created 03-15-2016 09:58 AM
Every copy of spark you start is "an application"; within the app are jobs and stages.
At the YARN level, there are only applications, those you see on the main page of the history server; those are the things which are automatically switched from incomplete to complete if the application finishes/fails/isn't found on the list of known applications.
Within the app, there are the jobs. If these are being considered incomplete, then it means the history saved by the YARN app isn't complete. That's what that screenshot shows. The app may have finished, but jobs inside aren't seen as such 1. The app was generating events faster than it could post them. Try setting a bigger size to batch up events to post
spark.hadoop.yarn.timeline.batch.size 500
2. The final set of events weren't posted when the application finished, because the application shut down first. Increase the wait time on shutdown.
spark.hadoop.yarn.timeline.shutdown.waittime 60s
3. Something happened on event playback. That's something we can look at next ... issue some curl commands directly against the ATS server to list all spark apps, grab the entire history of one, which can then be played back to see whats inside.
Created 03-15-2016 04:59 PM
I made those two changes and restarted Spark. A job submitted with '--master yarn-client' still behaves as before, with the history server not correctly tracking the job. A job submitted with '--master yarn-cluster' does get picked up as a completed job in history, but when I drill in there is absolutely no information available relative to the job. The environment tab is populated, but not with anything obviously job-specific. The 'executors' tab has the following:
which is suspiciously devoid of any actual activity. 'Stages', 'Storage' and 'Jobs' are completely blank.
I understand in the abstract what you're asking for in terms of querying the ATS server, but it's going to take me some time to determine the required web-service calls and put that together. It's something I probably need to know about, but won't have the time to dig in for a day or so.
Thanks for your help to this point! I'll try to get the rest of the information later this week.
Created 03-15-2016 07:05 PM
Your problems motivated me to write something on troubleshooting. This is for the latest version of the code, which currently only builds inside Spark 2.0-SNAPSHOT against Hadoop-2.8.0-SNAPSHOT, so not all the content is relevant (the publish-via-HDFS feature is new), but the URLs you need should be at the bottom.
Created 03-15-2016 07:07 PM
BTW, this is the URL, with the hostname:port of your server plugged in
http://timelineserver:59587/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO
for an attempt, you ask for it underneath:
http://timelineserver:59587/ws/v1/timeline/spark_event_v01/attempt_0001_0001
Created 03-16-2016 01:27 PM
Fantastic! That's a great example of useful and practical documentation. I'll let you know what I turn up from making the REST calls.