Created 12-20-2018 11:16 AM
I am running my spark job on emr cluster with executor memory 6g, driver memory 5g and memoryoverhead 1g.
But my task is failing with below error while writing into hdfs using spak session. i am storing file in orc format with snappy compression.
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)<br>Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in stage 11.0 failed 4 times, most recent failure: Lost task 42.3 in stage 11.0 (TID 3170, "server_IP", executor 23): ExecutorLostFailure (executor 23 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.2 GB of 6.6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Could you please give some suggestion.
Created 12-23-2018 06:34 PM
Consider boosting spark.yarn.executor.memoryOverhead from 6.6 GB to something higher than 8.2 GB, by adding "--conf spark.yarn.executor.memoryOverhead=10GB" to the spark-submit command. You could also workaround this by increasing the number of partitions (repartitioning) and number of executors.
Created 12-24-2018 06:08 AM
Hi dbompart, Thanks for your suggestion.
i have tried spark job with spark.yarn.executor.memoryOverhead =10g. But still the job fails with same issue.
ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.4 GB of 6.6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
And i have tried the work around. Increased partition from 66 (which was before re-partitioning the DF) to 200 using repartition . Still it doesnt work and it takes more time than the previous one(66 partition) since it shuffles data.
Could you please help here...
Created 12-24-2018 06:44 PM
Sure, can you share your spark-submit command with the arguments as well? Mask any sensitive information please.
Created 12-25-2018 02:16 AM
spark-submit --master yarn --deploy-mode client --driver-memory 5g --executor-memory 6g --conf "spark.yarn.executor.memoryOverhead=10g" --class myclass myjar.jar param1 param1 param3 param4 param5
Created 12-26-2018 06:58 AM
Hi Mani, use - - executor-memory 10g instead of 6g, and remove the overHead config property.
Created 12-27-2018 09:35 AM
Thank you for your help. the option didn't help. when i ran the job with executor-memory=10g, job failed with same error and size has changed like (11.8 GB of 10 GB physical memory used.).
spark-submit --master yarn --deploy-mode client --driver-memory 5g --executor-memory 10g --class myclass myjar.jar param1 param1 param3 param4 param5
So i tried with 15 gb of executor memory.
spark-submit --master yarn --deploy-mode client --driver-memory 5g --executor-memory 15g myclass myjar.jar param1 param1 param3 param4 param5
But tasks are taking long time (to find count - it took 1.2 hrs whereas with below 10gb of executor memory, it took 11 mins). Due to this task failed with below error.
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Slave lost
Created 12-27-2018 06:58 PM
Hi Mani, you might also want to increase the number of executors then, and may probably be able to lower the memory size. Try with:
spark-submit --master yarn --deploy-mode client --driver-memory 5g --num-executors 6 --executor-memory 8g myclass myjar.jar param1 param1 param3 param4 param5
Tunning this requires lots of other information like input data size, application use case, datasource information, cluster resources available, etc. Keep tunning --num-executors --executor-memory and --executor-cores (5 is usually a good number)
Created 01-07-2019 10:06 AM
Thanks dbompart and sorry for late reply.
i have tried different option (like number of cores, number of executors , executor memory, overhead memory) , But still same issue.
when i try re-partition before doing action, it takes more time and shuffle read/write has gone till 50 GB (actual size 8.9gb).
will keep trying...
Created 10-11-2019 03:09 AM
Are you still getting the same error even though after increasing the Overhead memory? Could you please share the Error messages after increasing the Overhead memory / Executor/ Driver memory?