I am facing Executor Lost issue while running my spark job in yarn-cluster mode with below error reason
ExecutorLostFailure ( executor 1 exited caused by one of the running tasks) Reason : Container marked as failed: container _e11_1475122993207_0126_01_000002 on host: "". Exist status :137. Diagnostics: Container killed on request. Exit code is 137 . Killed by external signal.
Memory parameters being used while running the job are as below:
This allocates upto 70 GB out of 78GB of memory on my server for my job.
The yarn memory utilization reaches 90% while running the job. Also, for executors , the memory limit as observed in jvisualvm is approx 19.3GB. It is observed that as soon as the executor memory reaches 16 .1 GB, the executor lost issue starts occuring. Also, the shuffle rate is high.
This is clear indication that the Executor is lost because of Out Of memory by OS.
Can you please suggest what could be the possible reason for this behavior ? This behavior is not observed everytime the job is runned. It only occurs at times.
How can I ensure that on consequent runs of this job , this issue with same configuration won't occur again?
18/12/10 11:33:12 ERROR YarnScheduler: Lost executor 2 on server1: Container marked as failed: container_e06_1544075636158_0018_01_000003 on host: server1. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Killed by external signal
The code was working fine with the default partition value of 19 but when partitionBy method was introduced and newHashpartitioner was increased to 38 , it throws the above error but the job was still running with no progress