Support Questions

Find answers, ask questions, and share your expertise

Spark job stage cancelled because SparkContext was shut down

avatar
Contributor

hi community,

I have spark job (spark job run on yarn) that failed with following error:

"stage cancelled because SparkContext was shut down"

after job failing a slowness was noticed on following jobs

Have you an idea what could be the reason?

how can I link spark job number to yarn applicationID?

where can I find logs of the failed job?

Thank you

8 REPLIES 8

avatar
Contributor

I have new elements:

Jobs were killed by a developer beause it was running for 12 hours

I found that a same task of a job is Hanging (in RUNNING state untill killing the whole job) on a same node.

the task is somotimes hanging and other time suceeding

What could make a task hanging that way?

thank you

avatar
Super Guru

@mohamed sabri marnaoui

Is it hanging or just waiting in the queue to run?

avatar
Contributor

hello

I'll check that on next run

Thank you

avatar
Guru
@mohamed sabri marnaoui

Can you past the full stack trace and the code you are trying to run?

You can get the spark job from the Yarn Resource manager UI. Go to Ambari -> Yarn -> QuickLinks -> Resource Manager UI

SparkContext can shutdown for many different reasons including code errors.

avatar
Master Guru

check your YARN resources, I think it does not have enough resources to run. Spark may have to wait for Zeppelin and other YARN apps to finish. Note: Zeppelin YARN app keeps running.

How did you do spark submit? --master yarn is needed. Is this a Scala job?

Check out: https://community.hortonworks.com/content/idea/29810/spark-configuration-best-practices.html

Check the Spark logs, YARN UI and Spark History UI

avatar
Contributor

Hi All

Thank you for your replies:

submit command:

spark-submit --master yarn-client --properties-file ${MY_CONF_DIR}/prediction.properties \ --driver-memory 6G \ --executor-memory 10G \ --num-executors 5 \ --executor-cores 13 \ --class com.comp.bdf.nat.applications.$1 \ --jars ${MY_CLASSPATH} \ ${MY_LIB_DIR}/prediction.jar $PHASE "$ARG_COMPL" "${PARAMETERS[@]}"

No yarn problems were detected. here are some screenshots thet I got on Spark UI: as I mentioned before te same stage (773) is keeping in "RUNNING STATE" and always on the same node

note that this node was recently added to the cluster, could it be a problem of versions?

8562-daky0.png

org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
com.vsct.sncf.nat.outils.AppOutils$anonfun$sauverDf$1$anonfun$apply$1.apply(AppOutils.scala:506)
com.vsct.sncf.nat.outils.AppOutils$anonfun$sauverDf$1$anonfun$apply$1.apply(AppOutils.scala:498)
com.vsct.sncf.nat.outils.AppOutils$.remplacerDf(AppOutils.scala:483)
com.vsct.sncf.nat.applications.CreerPrediction$.lancer(CreerPrediction.scala:97)
com.vsct.sncf.nat.applications.ApplicationNAT.main(ApplicationNAT.scala:78)
com.vsct.sncf.nat.applications.CreerPrediction.main(CreerPrediction.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:731)
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

When I click on link: save at AppOutils.scala:506

8563-xcnib.png

avatar
Master Guru

sounds like a coding error, how are you ending your code? how big is the data, seems it can't process all of it.

could be an issue with your parquet file, maybe try to save to another format to ORC, AVRO or JSON or HIVE.

can you post the save source code from the datawriter around here: AppOutils.scala:506

try this processing mode and allocate more cpu and more memory

--master yarn --deploy-mode cluster

avatar
New Contributor

This problem has been happening on our side since many months as well. Both with Spark1 and Spark2. Both while running jobs in the shell as well as in Python notebooks. And it is very easy to reproduce. Just open a notebook and let it run for a couple of hours. Or just do some simple dataframe operations in an infinite loop.

There seems to be something fundamentally wrong with the timeout configurations in the core of Spark. We will open a case for that as no matter what kind of configurations we have tried, the problem insists.