About stevel

stevel · ‎11-07-2016

All that appears to show is that while the spark driver was waiting for the job to finish, it got interrupted. That generally means someone/something interrupted the process using the signal API call/kill command line action, or a control-C action. There's no other information in that trace, I'm afraid

stevel · ‎11-07-2016

you shouldn't be seeing this on HDP2.5; everything needed to talk to S3A is on the classpath for spark already (we have done a lot of work on S3a performance for this release) Is the job actually failing, or is it just warning you that it couldn't create the s3a filesystem, but carrying on?

stevel · ‎04-23-2016

Interesting. Spark 1.6 added a way to start plugin classes in the driver on Yarn clusters; adding one to set up the Prometheus listener should be straightforward. Once implemented, all you'd need to do to start it would be add it to the classpath and list the classname in the right spark configuration option

stevel · ‎03-19-2016

you can search for a file across the entire filesystem. This won't find libraries which have copied the contents of the JAR in (spark-assembly), but it will find self-contained tachyon releases find . -name tachyon\*.jar -print

stevel · ‎03-18-2016

Have you any tachyon.jar somewhere? Because that's caused by an old version getting on the classpath see: https://issues.apache.org/jira/browse/SPARK-8385?focusedCommentId=14643652&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14643652

stevel · ‎03-15-2016

BTW, this is the URL, with the hostname:port of your server plugged in http://timelineserver:59587/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO for an attempt, you ask for it underneath: http://timelineserver:59587/ws/v1/timeline/spark_event_v01/attempt_0001_0001

stevel · ‎03-15-2016

Your problems motivated me to write something on troubleshooting. This is for the latest version of the code, which currently only builds inside Spark 2.0-SNAPSHOT against Hadoop-2.8.0-SNAPSHOT, so not all the content is relevant (the publish-via-HDFS feature is new), but the URLs you need should be at the bottom.

stevel · ‎03-15-2016

Every copy of spark you start is "an application"; within the app are jobs and stages. At the YARN level, there are only applications, those you see on the main page of the history server; those are the things which are automatically switched from incomplete to complete if the application finishes/fails/isn't found on the list of known applications. Within the app, there are the jobs. If these are being considered incomplete, then it means the history saved by the YARN app isn't complete. That's what that screenshot shows. The app may have finished, but jobs inside aren't seen as such 1. The app was generating events faster than it could post them. Try setting a bigger size to batch up events to post spark.hadoop.yarn.timeline.batch.size 500 2. The final set of events weren't posted when the application finished, because the application shut down first. Increase the wait time on shutdown. spark.hadoop.yarn.timeline.shutdown.waittime 60s 3. Something happened on event playback. That's something we can look at next ... issue some curl commands directly against the ATS server to list all spark apps, grab the entire history of one, which can then be played back to see whats inside.

stevel · ‎03-14-2016

There's some documentation on options you can set; though that doc is for the next iteration of the timeline server, which is slightly different. I should write up a proper tutorial on the subject+slide. One thing you can do is set the spark context option: spark.history.yarn.diagnostics true This adds more detail to the key-value list of items shown on the history page, with some status on "does the history server think the Yarn ATS server is working". What you've got looks very much YARN timeline server and not spark side. The tests to verify this is simple, put this in your browser http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO; -lots of JSON: all is well -500+error: bad From the error, 16/03/14 11:25:04 WARN YarnHistoryProvider: Failed to list entities from http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/ java.io.IOException: Bad GET request: status code 500 against http://bigfoot6.watson.ibm.com:8188/ws/v1/timeline/spark_event_v01?fields=PRIMARYFILTERS,OTHERINFO; {"exception":"WebApplicationException","message":"java.io.IOException: org.iq80.leveldb.DBException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/005567.sst: No such file or directo at org.apache.spark.deploy.history.yarn.rest.JerseyBinding$.translateException(JerseyBinding.scala:208) at org.apache.spark.deploy.history.yarn.rest.JerseyBinding$.translateException(JerseyBinding.scala:91) This is not good: it's a logging of the response which came from YARN, the file tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/005567.sst isn't here. A quick scan of "leveldb SST file not found YARN" turns up YARN-2873, ", related to work-preserving node manager restart; they indicate that the problem is related to leveldb file loss. 1. I think it's wise to move that history store off /tmp ; set the yarn-site property yarn.timeline-service.leveldb-timeline-store.path to something other than ${hadoop.tmp.dir}/yarn/timeline Trying that and restarting the Yarn application timeline service will give you a fresh start in directories of persisted data 2. I'm going contact some of my colleagues who work in the yarn-side of things to see if they can provide better insight into what's gone wrong. I fear they may not be able to come up with an answer other than "something happened in levelDB", but we can clearly improve how ATS handles this.

stevel · ‎03-14-2016

There's no WS-* code, hence no need for the WS-* stuff. OAuth? Maybe some time in the future. Note also: SASL, SPNEGO

Online	Offline
Last Visited	‎03-13-2023 07:42 AM

Name	Steve Loughran
Location	Bristol, England
Member Since	‎09-26-2015 10:24 AM
Last Visited	‎03-13-2023 07:42 AM
Posts	135
Kudos received	85

Cloudera Community

Re: Hbase Restore using the Backup ID from S3 thro...

Re: What is EMRFS? Is it a file system in AWS that...

Re: How to hotswap Data node hard disk without sto...

Re: HDFS Encryption Data at Rest - in Non-Kerberiz...

Re: Spark Weird Error

Re: Got an " Exception in thread “InterruptedExce...

Re: spark and s3 dependencies

Re: Monitoring Spark Jobs

Re: oozie submit spark action but get a error: Not...

Re: oozie submit spark action but get a error: Not...

Re: Spark UI thinks job is incomplete

Re: Spark UI thinks job is incomplete

Re: Spark UI thinks job is incomplete

Re: Spark UI thinks job is incomplete

Re: Does Hortonworks support the following Securit...