Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark memory issue

avatar

I have 3 node cluster and trying to run the command.

I am running follwoing command to run class file

java -cp .:spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar:spark-csv_2.10-1.4.0.jar:commons-csv-1.1.jar SparkMainV4 "spark://xyz.abc.com:7077" "WD" "spark.executor.memory;6g,spark.shuffle.consolidateFile;false,spark.driver.memory;5g,spark.akka.frameSize;2047,spark.locality.wait;600,spark.network.timeout;600,spark.sql.shuffle.partitions;500"

but getting error :-

ERROR TaskSchedulerImpl: Lost executor 1 on xyz.abc.com: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 67 (saveAsTextFile at package.scala:179) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 36 at org.apache.spark.MapOutputTracker$anonfun$org$apache$spark$MapOutputTracker$convertMapStatuses$2.apply(MapOutputTracker.scala:542) at org.apache.spark.MapOutputTracker$anonfun$org$apache$spark$MapOutputTracker$convertMapStatuses$2.apply(MapOutputTracker.scala:538) at scala.collection.TraversableLike$WithFilter$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$convertMapStatuses(MapOutputTracker.scala:538)

5 REPLIES 5

avatar
Expert Contributor

It is happening during pause on long-running jobs on a large data set. As per the logs, during a shuffle step an executor fails and doesn't report its output, and during the reduce step, that output can't be found where expected and rather than rerunning the failed execution, Spark goes down. Try to reduce parallelism to executors x cores.

avatar

We are not using parallelism, could you please help from where I can reduce the cores.

And yesterday this same code was working fine.

avatar
Expert Contributor

does your spark job failed ?

These messages can be because if spark dynamic allocation, possibly release of executor.

Maybe resource are not free on the YARN, containers timeout

Any other error message in the log?

avatar

Yes spark job is failed. We are trying to coalesce the file. But getting the error.

avatar

There is no error like timout but, I incresed the ram to 64 GB and it's works.