Created 07-13-2016 11:24 AM
I have 3 node cluster and trying to run the command.
I am running follwoing command to run class file
java -cp .:spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar:spark-csv_2.10-1.4.0.jar:commons-csv-1.1.jar SparkMainV4 "spark://xyz.abc.com:7077" "WD" "spark.executor.memory;6g,spark.shuffle.consolidateFile;false,spark.driver.memory;5g,spark.akka.frameSize;2047,spark.locality.wait;600,spark.network.timeout;600,spark.sql.shuffle.partitions;500"
but getting error :-
ERROR TaskSchedulerImpl: Lost executor 1 on xyz.abc.com: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 67 (saveAsTextFile at package.scala:179) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 36 at org.apache.spark.MapOutputTracker$anonfun$org$apache$spark$MapOutputTracker$convertMapStatuses$2.apply(MapOutputTracker.scala:542) at org.apache.spark.MapOutputTracker$anonfun$org$apache$spark$MapOutputTracker$convertMapStatuses$2.apply(MapOutputTracker.scala:538) at scala.collection.TraversableLike$WithFilter$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$convertMapStatuses(MapOutputTracker.scala:538)
Created 07-13-2016 11:52 AM
It is happening during pause on long-running jobs on a large data set. As per the logs, during a shuffle step an executor fails and doesn't report its output, and during the reduce step, that output can't be found where expected and rather than rerunning the failed execution, Spark goes down. Try to reduce parallelism to executors x cores.
Created 07-13-2016 11:55 AM
We are not using parallelism, could you please help from where I can reduce the cores.
And yesterday this same code was working fine.
Created 07-13-2016 12:45 PM
does your spark job failed ?
These messages can be because if spark dynamic allocation, possibly release of executor.
Maybe resource are not free on the YARN, containers timeout
Any other error message in the log?
Created 07-13-2016 01:06 PM
Yes spark job is failed. We are trying to coalesce the file. But getting the error.
Created 07-22-2016 06:56 AM
There is no error like timout but, I incresed the ram to 64 GB and it's works.