Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Spark UI thinks job is incomplete

Rising Star

For the past month or so, all Spark jobs are either not appearing in the Spark History UI or showing as incomplete. YARN is correctly reporting all jobs, but Spark claims there are more steps yet to be run.

A little background: At one point the logs started filling with errors from the Spark history service about a non-existent file. I ended up stopping the Spark history server and deleting everything in the directory it was yelling about, then restarting. I suspect I damaged something in the process and could use some advice on reinitializing the service.

17 REPLIES 17

Rising Star

More information: Around the time the history server stopped working correctly, a cascade of exceptions appeared in the spark logs:

2016-02-04 14:55:09,035 WARN timeline.TimelineDataManager (TimelineDataManager.java:doPostEntities(366)) - Skip the timeline entity: { id: tez_con tainer_e07_1453990729709_0165_01_000043, type: TEZ_CONTAINER_ID }

org.apache.hadoop.yarn.exceptions.YarnException: The domain of the timeline entity { id: tez_container_e07_1453990729709_0165_01_000043, type: TEZ_ CONTAINER_ID } is not allowed to be changed from Tez_ATS_application_1453990729709_0165 to Tez_ATS_application_1453990729709_0165_wanghai_201602041 45330_86a58f3a-0891-4c24-bf0f-0375575077da:1

Does that shed any light on the underlying problem? The log contains > 50 MB of such messages.

Spark History Server is using Timeline Server under the cover. Can you check in Ambari if ATS is running and working correctly? Other tools that use Timeline Server are the Tez view for example, so you could try that one to see if Timeline server is working in general.

Rising Star

The Tez view appears to be working correctly. That endless cascade of exceptions from the history server must be pointing to something specific, but I unfortunately do not know how to interpret it. One of our users mentioned to me that the lingering jobs in the Spark UI are all using a Python method called 'toPandas', while the few that do get properly noted as complete do not. Is that a useful clue?

The Spark "incomplete" history continues to pile up dozens of jobs that are reported on the console (and by YARN) as being finished.

@Steven Hirsch -let me look at this

1. Which version of HDP, OS, etc?

2. are there any logs in the spark history server?

3. If you restart the spark history server, do the jobs come up as incomplete?

The spark history server in spark <= 1.6 doesn't detects updates to incomplete jobs once they've been clicked on and loaded (SPARK-7889; will be fixed in Spark 2), but the UI listing complete/incomplete apps should work. If the Yarn history server is used as the back end for spark histories, then the code there will check with YARN to see if the application is still running. If the filesystem-based log mechanism is used, then the spark history server code doesn't ask yarn about application state. Instead it just plays back the file until it gets to the end: if there isn't a logged "application ended" event there, it will languish as "incomplete" forever

Rising Star

I'm starting to get concerned about this issue. We have run about 50 jobs in Spark that return results without any exceptional conditions, and which the YARN UI reports as complete. All of them are languishing in the Spark UI incomplete job listing with over 150 steps (it claims) left to go. The offending operation is either 'toPandas' or 'treeAggregate at GradientDescent.scala:189'. I do not see any sign that these processes are actually alive. Why are they not being reported as done?

Mentor

Can you look at yarn container level, drill down to each job's task and see the information you get. If this causes you a lot of trouble, consider opening a ticket with support.

Rising Star

Thanks, but we do not have a support agreement. We'll just have to live with it. I've provided all the information I have.

Mentor

Go through the spark user guide for HDP 2.4, there are a lot of properties to review as far as history server.without support contract, documentation is your best friend http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_spark-guide/content/ch_introduction-spark...

Rising Star

@stevel

Hi. We are using HDP-2.3.2.0-2950, with all nodes running Centos 6.7.

Not sure I know how to answer the question about logs. For starters, it's not easy to understand where these would be. If I assume to history server to be the machine that I connect to for the Spark History UI and I assume that job-related logs would be under /tmp, then there's nothing relevant on that box. If I look on the namenode I can see /tmp/hadoop/yarn/timeline with populated subdirectories. Are those what you are referring to?

I restarted the history server and now things are utterly non-functional. The Spark History UI shows nothing under either complete or incomplete and displays an error:

Last Operation Failure: java.io.IOException: Bad GET request: status code 500 against http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO; {"exception":"WebApplicationException","message":"java.io.IOException: org.iq80.leveldb.DBException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/005567.sst: No such file or directo

Indeed, there is no file by that particular name, but there are dozens of other .sst files present. What is causing it to look for that specific file and, further, why is it giving up completely after not finding it?

We are using the YARN history service as the backend.

FYI: After restarting the history server, I'm getting this is in the daemon logs on the history server host:

spark-spark-orgapachesparkdeployhistoryhistoryserv.txt

It looks very unhappy. All of this had been working fine as recently as late January, and I have not (knowingly) made any changes whatsoever to the Spark history configuration.

Please let me know if you need any further information. I've looked through the Hortonworks courses on Hadoop management, but haven't seen any syllabus that claims to cover troubleshooting at a sufficiently low level. If that's not the case, can you advise which of them would provide enough background to be able to help in a case such as this?

There's some documentation on options you can set; though that doc is for the next iteration of the timeline server, which is slightly different. I should write up a proper tutorial on the subject+slide.

One thing you can do is set the spark context option:

spark.history.yarn.diagnostics true

This adds more detail to the key-value list of items shown on the history page, with some status on "does the history server think the Yarn ATS server is working".

What you've got looks very much YARN timeline server and not spark side. The tests to verify this is simple, put this in your browser

http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO;

-lots of JSON: all is well

-500+error: bad

From the error,

16/03/14 11:25:04 WARN YarnHistoryProvider: Failed to list entities from http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/
java.io.IOException: Bad GET request: status code 500 against http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO; {"exception":"WebApplicationException","message":"java.io.IOException: org.iq80.leveldb.DBException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/005567.sst: No such file or directo
	at org.apache.spark.deploy.history.yarn.rest.JerseyBinding$.translateException(JerseyBinding.scala:208)
	at org.apache.spark.deploy.history.yarn.rest.JerseyBinding$.translateException(JerseyBinding.scala:91)

This is not good: it's a logging of the response which came from YARN, the file tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/005567.sst isn't here.

A quick scan of "leveldb SST file not found YARN" turns up YARN-2873, ", related to work-preserving node manager restart; they indicate that the problem is related to leveldb file loss.

1. I think it's wise to move that history store off /tmp ; set the yarn-site property

yarn.timeline-service.leveldb-timeline-store.path to something other than ${hadoop.tmp.dir}/yarn/timeline

Trying that and restarting the Yarn application timeline service will give you a fresh start in directories of persisted data

2. I'm going contact some of my colleagues who work in the yarn-side of things to see if they can provide better insight into what's gone wrong. I fear they may not be able to come up with an answer other than "something happened in levelDB", but we can clearly improve how ATS handles this.

Rising Star

@stevel

I am seeing that same 500 error when working directory from the browser. Moving the timeline storage path is not a problem. I've read some suggestions about moving it to HDFS, but I'm not sure what other ramifications that may have so I'll stick with machine-local storage for now.

Not sure if you saw one of my earlier posts where I mentioned that the Spark daemon log is filling with errors whose meaning is not clear (see beginning of thread above). Perhaps that will go away when I relocate the log directory.

Rising Star

@stevel

I moved the yarn timeline directory to /hadoop/yarn and restarted. I'm no longer seeing the 500 error from the Spark History UI, but it continues to list completed Spark jobs in 'incomplete' telling me that there are hundreds of tasks remaining to be run. The YARN history UI does correctly report that the job is complete. incomplete.png

The developer who owns the application tells me that it appears to be returning proper results.

Every copy of spark you start is "an application"; within the app are jobs and stages.

At the YARN level, there are only applications, those you see on the main page of the history server; those are the things which are automatically switched from incomplete to complete if the application finishes/fails/isn't found on the list of known applications.

Within the app, there are the jobs. If these are being considered incomplete, then it means the history saved by the YARN app isn't complete. That's what that screenshot shows. The app may have finished, but jobs inside aren't seen as such 1. The app was generating events faster than it could post them. Try setting a bigger size to batch up events to post

spark.hadoop.yarn.timeline.batch.size 500

2. The final set of events weren't posted when the application finished, because the application shut down first. Increase the wait time on shutdown.

spark.hadoop.yarn.timeline.shutdown.waittime 60s

3. Something happened on event playback. That's something we can look at next ... issue some curl commands directly against the ATS server to list all spark apps, grab the entire history of one, which can then be played back to see whats inside.

Rising Star

@stevel

I made those two changes and restarted Spark. A job submitted with '--master yarn-client' still behaves as before, with the history server not correctly tracking the job. A job submitted with '--master yarn-cluster' does get picked up as a completed job in history, but when I drill in there is absolutely no information available relative to the job. The environment tab is populated, but not with anything obviously job-specific. The 'executors' tab has the following:

executors.png

which is suspiciously devoid of any actual activity. 'Stages', 'Storage' and 'Jobs' are completely blank.

I understand in the abstract what you're asking for in terms of querying the ATS server, but it's going to take me some time to determine the required web-service calls and put that together. It's something I probably need to know about, but won't have the time to dig in for a day or so.

Thanks for your help to this point! I'll try to get the rest of the information later this week.

Your problems motivated me to write something on troubleshooting. This is for the latest version of the code, which currently only builds inside Spark 2.0-SNAPSHOT against Hadoop-2.8.0-SNAPSHOT, so not all the content is relevant (the publish-via-HDFS feature is new), but the URLs you need should be at the bottom.

BTW, this is the URL, with the hostname:port of your server plugged in

http://timelineserver:59587/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO

for an attempt, you ask for it underneath:

http://timelineserver:59587/ws/v1/timeline/spark_event_v01/attempt_0001_0001

Rising Star

@stevel

Fantastic! That's a great example of useful and practical documentation. I'll let you know what I turn up from making the REST calls.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.