Support Questions

Find answers, ask questions, and share your expertise

Facing Executor Lost issue while running my spark job in yarn-cluster mode,I am facing executor lost issue while running my job :

avatar
New Contributor

I am facing Executor Lost issue while running my spark job in yarn-cluster mode with below error reason

ExecutorLostFailure ( executor 1 exited caused by one of the running tasks) Reason : Container marked as failed: container _e11_1475122993207_0126_01_000002 on host: "". Exist status :137. Diagnostics: Container killed on request. Exit code is 137 . Killed by external signal. Memory parameters being used while running the job are as below:

--master yarn --deploy-mode cluster --executor-cores 4 --num-executors 3 --executor-memory 18G --driver-memory 3g --conf spark.yarn.executor.memoryOverhead=5120

This allocates upto 70 GB out of 78GB of memory on my server for my job.

The yarn memory utilization reaches 90% while running the job. Also, for executors , the memory limit as observed in jvisualvm is approx 19.3GB. It is observed that as soon as the executor memory reaches 16 .1 GB, the executor lost issue starts occuring. Also, the shuffle rate is high.

This is clear indication that the Executor is lost because of Out Of memory by OS.

Can you please suggest what could be the possible reason for this behavior ? This behavior is not observed everytime the job is runned. It only occurs at times.

How can I ensure that on consequent runs of this job , this issue with same configuration won't occur again?

,

2 REPLIES 2

avatar
Contributor

Just checking were you able to find the RCA. Thanks

avatar
New Contributor

I faced a similar issue with the following error

18/12/10 11:33:12 ERROR YarnScheduler: Lost executor 2 on server1: Container marked as failed: container_e06_1544075636158_0018_01_000003 on host: server1. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143. Killed by external signal

The code was working fine with the default partition value of 19 but when partitionBy method was introduced and newHashpartitioner was increased to 38 , it throws the above error but the job was still running with no progress