Member since
09-26-2015
135
Posts
85
Kudos Received
26
Solutions
About
Steve's a hadoop committer mostly working on cloud integration
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3304 | 02-27-2018 04:47 PM | |
5815 | 03-03-2017 10:04 PM | |
3443 | 02-16-2017 10:18 AM | |
1816 | 01-20-2017 02:15 PM | |
11756 | 01-20-2017 02:02 PM |
11-07-2016
03:38 PM
All that appears to show is that while the spark driver was waiting for the job to finish, it got interrupted. That generally means someone/something interrupted the process using the signal API call/kill command line action, or a control-C action. There's no other information in that trace, I'm afraid
... View more
11-07-2016
03:28 PM
you shouldn't be seeing this on HDP2.5; everything needed to talk to S3A is on the classpath for spark already (we have done a lot of work on S3a performance for this release) Is the job actually failing, or is it just warning you that it couldn't create the s3a filesystem, but carrying on?
... View more
04-23-2016
12:24 PM
1 Kudo
Interesting. Spark 1.6 added a way to start plugin classes in the driver on Yarn clusters; adding one to set up the Prometheus listener should be straightforward. Once implemented, all you'd need to do to start it would be add it to the classpath and list the classname in the right spark configuration option
... View more
03-19-2016
12:53 PM
2 Kudos
you can search for a file across the entire filesystem. This won't find libraries which have copied the contents of the JAR in (spark-assembly), but it will find self-contained tachyon releases find . -name tachyon\*.jar -print
... View more
03-18-2016
11:55 AM
1 Kudo
Have you any tachyon.jar somewhere? Because that's caused by an old version getting on the classpath see: https://issues.apache.org/jira/browse/SPARK-8385?focusedCommentId=14643652&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14643652
... View more
03-15-2016
07:07 PM
1 Kudo
BTW, this is the URL, with the hostname:port of your server plugged in http://timelineserver:59587/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO for an attempt, you ask for it underneath: http://timelineserver:59587/ws/v1/timeline/spark_event_v01/attempt_0001_0001
... View more
03-15-2016
07:05 PM
1 Kudo
Your problems motivated me to write something on troubleshooting. This is for the latest version of the code, which currently only builds inside Spark 2.0-SNAPSHOT against Hadoop-2.8.0-SNAPSHOT, so not all the content is relevant (the publish-via-HDFS feature is new), but the URLs you need should be at the bottom.
... View more
03-15-2016
09:58 AM
3 Kudos
Every copy of spark you start is "an application"; within the app are jobs and stages. At the YARN level, there are only applications, those you see on the main page of the history server; those are the things which are automatically switched from incomplete to complete if the application finishes/fails/isn't found on the list of known applications. Within the app, there are the jobs. If these are being considered incomplete, then it means the history saved by the YARN app isn't complete. That's what that screenshot shows. The app may have finished, but jobs inside aren't seen as such
1. The app was generating events faster than it could post them. Try setting a bigger size to batch up events to post spark.hadoop.yarn.timeline.batch.size 500 2. The final set of events weren't posted when the application finished, because the application shut down first. Increase the wait time on shutdown. spark.hadoop.yarn.timeline.shutdown.waittime 60s 3. Something happened on event playback. That's something we can look at next ... issue some curl commands directly against the ATS server to list all spark apps, grab the entire history of one, which can then be played back to see whats inside.
... View more
03-14-2016
04:46 PM
2 Kudos
There's some documentation on options you can set; though that doc is for the next iteration of the timeline server, which is slightly different. I should write up a proper tutorial on the subject+slide. One thing you can do is set the spark context option: spark.history.yarn.diagnostics true This adds more detail to the key-value list of items shown on the history page, with some status on "does the history server think the Yarn ATS server is working". What you've got looks very much YARN timeline server and not spark side. The tests to verify this is simple, put this in your browser http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO; -lots of JSON: all is well -500+error: bad From the error, 16/03/14 11:25:04 WARN YarnHistoryProvider: Failed to list entities from http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/
java.io.IOException: Bad GET request: status code 500 against http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO; {"exception":"WebApplicationException","message":"java.io.IOException: org.iq80.leveldb.DBException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/005567.sst: No such file or directo
at org.apache.spark.deploy.history.yarn.rest.JerseyBinding$.translateException(JerseyBinding.scala:208)
at org.apache.spark.deploy.history.yarn.rest.JerseyBinding$.translateException(JerseyBinding.scala:91) This is not good: it's a logging of the response which came from YARN, the file tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/005567.sst isn't here. A quick scan of "leveldb SST file not found YARN" turns up YARN-2873, ", related to work-preserving node manager restart; they indicate that the problem is related to leveldb file loss. 1. I think it's wise to move that history store off /tmp ; set the yarn-site property yarn.timeline-service.leveldb-timeline-store.path to something other than ${hadoop.tmp.dir}/yarn/timeline Trying that and restarting the Yarn application timeline service will give you a fresh start in directories of persisted data 2. I'm going contact some of my colleagues who work in the yarn-side of things to see if they can provide better insight into what's gone wrong. I fear they may not be able to come up with an answer other than "something happened in levelDB", but we can clearly improve how ATS handles this.
... View more
03-14-2016
03:30 PM
There's no WS-* code, hence no need for the WS-* stuff. OAuth? Maybe some time in the future. Note also: SASL, SPNEGO
... View more