Created 03-07-2016 09:23 PM
For the past month or so, all Spark jobs are either not appearing in the Spark History UI or showing as incomplete. YARN is correctly reporting all jobs, but Spark claims there are more steps yet to be run.
A little background: At one point the logs started filling with errors from the Spark history service about a non-existent file. I ended up stopping the Spark history server and deleting everything in the directory it was yelling about, then restarting. I suspect I damaged something in the process and could use some advice on reinitializing the service.
Created 03-07-2016 09:37 PM
More information: Around the time the history server stopped working correctly, a cascade of exceptions appeared in the spark logs:
2016-02-04 14:55:09,035 WARN timeline.TimelineDataManager (TimelineDataManager.java:doPostEntities(366)) - Skip the timeline entity: { id: tez_con tainer_e07_1453990729709_0165_01_000043, type: TEZ_CONTAINER_ID }
org.apache.hadoop.yarn.exceptions.YarnException: The domain of the timeline entity { id: tez_container_e07_1453990729709_0165_01_000043, type: TEZ_ CONTAINER_ID } is not allowed to be changed from Tez_ATS_application_1453990729709_0165 to Tez_ATS_application_1453990729709_0165_wanghai_201602041 45330_86a58f3a-0891-4c24-bf0f-0375575077da:1
Does that shed any light on the underlying problem? The log contains > 50 MB of such messages.
Created 03-08-2016 12:29 PM
Spark History Server is using Timeline Server under the cover. Can you check in Ambari if ATS is running and working correctly? Other tools that use Timeline Server are the Tez view for example, so you could try that one to see if Timeline server is working in general.
Created 03-08-2016 02:17 PM
The Tez view appears to be working correctly. That endless cascade of exceptions from the history server must be pointing to something specific, but I unfortunately do not know how to interpret it. One of our users mentioned to me that the lingering jobs in the Spark UI are all using a Python method called 'toPandas', while the few that do get properly noted as complete do not. Is that a useful clue?
The Spark "incomplete" history continues to pile up dozens of jobs that are reported on the console (and by YARN) as being finished.
Created 03-14-2016 03:11 PM
@Steven Hirsch -let me look at this
1. Which version of HDP, OS, etc?
2. are there any logs in the spark history server?
3. If you restart the spark history server, do the jobs come up as incomplete?
The spark history server in spark <= 1.6 doesn't detects updates to incomplete jobs once they've been clicked on and loaded (SPARK-7889; will be fixed in Spark 2), but the UI listing complete/incomplete apps should work. If the Yarn history server is used as the back end for spark histories, then the code there will check with YARN to see if the application is still running. If the filesystem-based log mechanism is used, then the spark history server code doesn't ask yarn about application state. Instead it just plays back the file until it gets to the end: if there isn't a logged "application ended" event there, it will languish as "incomplete" forever
Created 03-08-2016 10:34 PM
I'm starting to get concerned about this issue. We have run about 50 jobs in Spark that return results without any exceptional conditions, and which the YARN UI reports as complete. All of them are languishing in the Spark UI incomplete job listing with over 150 steps (it claims) left to go. The offending operation is either 'toPandas' or 'treeAggregate at GradientDescent.scala:189'. I do not see any sign that these processes are actually alive. Why are they not being reported as done?
Created 03-08-2016 10:36 PM
Can you look at yarn container level, drill down to each job's task and see the information you get. If this causes you a lot of trouble, consider opening a ticket with support.
Created 03-09-2016 02:57 PM
Thanks, but we do not have a support agreement. We'll just have to live with it. I've provided all the information I have.
Created 03-09-2016 04:11 PM
Go through the spark user guide for HDP 2.4, there are a lot of properties to review as far as history server.without support contract, documentation is your best friend http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_spark-guide/content/ch_introduction-spark...
Created 03-14-2016 04:05 PM
Hi. We are using HDP-2.3.2.0-2950, with all nodes running Centos 6.7.
Not sure I know how to answer the question about logs. For starters, it's not easy to understand where these would be. If I assume to history server to be the machine that I connect to for the Spark History UI and I assume that job-related logs would be under /tmp, then there's nothing relevant on that box. If I look on the namenode I can see /tmp/hadoop/yarn/timeline with populated subdirectories. Are those what you are referring to?
I restarted the history server and now things are utterly non-functional. The Spark History UI shows nothing under either complete or incomplete and displays an error:
Last Operation Failure: java.io.IOException: Bad GET request: status code 500 against http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO; {"exception":"WebApplicationException","message":"java.io.IOException: org.iq80.leveldb.DBException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/005567.sst: No such file or directo
Indeed, there is no file by that particular name, but there are dozens of other .sst files present. What is causing it to look for that specific file and, further, why is it giving up completely after not finding it?
We are using the YARN history service as the backend.
FYI: After restarting the history server, I'm getting this is in the daemon logs on the history server host:
spark-spark-orgapachesparkdeployhistoryhistoryserv.txt
It looks very unhappy. All of this had been working fine as recently as late January, and I have not (knowingly) made any changes whatsoever to the Spark history configuration.
Please let me know if you need any further information. I've looked through the Hortonworks courses on Hadoop management, but haven't seen any syllabus that claims to cover troubleshooting at a sufficiently low level. If that's not the case, can you advise which of them would provide enough background to be able to help in a case such as this?