Created 10-11-2016 10:20 AM
hi community,
I have spark job (spark job run on yarn) that failed with following error:
"stage cancelled because SparkContext was shut down"
after job failing a slowness was noticed on following jobs
Have you an idea what could be the reason?
how can I link spark job number to yarn applicationID?
where can I find logs of the failed job?
Thank you
Created 10-11-2016 03:55 PM
I have new elements:
Jobs were killed by a developer beause it was running for 12 hours
I found that a same task of a job is Hanging (in RUNNING state untill killing the whole job) on a same node.
the task is somotimes hanging and other time suceeding
What could make a task hanging that way?
thank you
Created 10-11-2016 08:23 PM
Is it hanging or just waiting in the queue to run?
Created 10-12-2016 08:08 AM
hello
I'll check that on next run
Thank you
Created 10-12-2016 10:58 PM
Can you past the full stack trace and the code you are trying to run?
You can get the spark job from the Yarn Resource manager UI. Go to Ambari -> Yarn -> QuickLinks -> Resource Manager UI
SparkContext can shutdown for many different reasons including code errors.
Created 10-13-2016 12:34 AM
check your YARN resources, I think it does not have enough resources to run. Spark may have to wait for Zeppelin and other YARN apps to finish. Note: Zeppelin YARN app keeps running.
How did you do spark submit? --master yarn is needed. Is this a Scala job?
Check out: https://community.hortonworks.com/content/idea/29810/spark-configuration-best-practices.html
Check the Spark logs, YARN UI and Spark History UI
Created on 10-14-2016 08:56 AM - edited 08-19-2019 03:16 AM
Hi All
Thank you for your replies:
submit command:
spark-submit --master yarn-client --properties-file ${MY_CONF_DIR}/prediction.properties \ --driver-memory 6G \ --executor-memory 10G \ --num-executors 5 \ --executor-cores 13 \ --class com.comp.bdf.nat.applications.$1 \ --jars ${MY_CLASSPATH} \ ${MY_LIB_DIR}/prediction.jar $PHASE "$ARG_COMPL" "${PARAMETERS[@]}"
No yarn problems were detected. here are some screenshots thet I got on Spark UI: as I mentioned before te same stage (773) is keeping in "RUNNING STATE" and always on the same node
note that this node was recently added to the cluster, could it be a problem of versions?
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) com.vsct.sncf.nat.outils.AppOutils$anonfun$sauverDf$1$anonfun$apply$1.apply(AppOutils.scala:506) com.vsct.sncf.nat.outils.AppOutils$anonfun$sauverDf$1$anonfun$apply$1.apply(AppOutils.scala:498) com.vsct.sncf.nat.outils.AppOutils$.remplacerDf(AppOutils.scala:483) com.vsct.sncf.nat.applications.CreerPrediction$.lancer(CreerPrediction.scala:97) com.vsct.sncf.nat.applications.ApplicationNAT.main(ApplicationNAT.scala:78) com.vsct.sncf.nat.applications.CreerPrediction.main(CreerPrediction.scala) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:731) org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
When I click on link: save at AppOutils.scala:506
Created 10-14-2016 05:20 PM
sounds like a coding error, how are you ending your code? how big is the data, seems it can't process all of it.
could be an issue with your parquet file, maybe try to save to another format to ORC, AVRO or JSON or HIVE.
can you post the save source code from the datawriter around here: AppOutils.scala:506
try this processing mode and allocate more cpu and more memory
--master yarn --deploy-mode cluste
r
Created 04-06-2017 12:07 PM
This problem has been happening on our side since many months as well. Both with Spark1 and Spark2. Both while running jobs in the shell as well as in Python notebooks. And it is very easy to reproduce. Just open a notebook and let it run for a couple of hours. Or just do some simple dataframe operations in an infinite loop.
There seems to be something fundamentally wrong with the timeout configurations in the core of Spark. We will open a case for that as no matter what kind of configurations we have tried, the problem insists.