Created 05-25-2018 12:53 AM
Container exited with a non-zero exit code 137
Killed by external signal
This error randomly kills hive and sqoop jobs.
Is there anyone here who is willing to support? Been trying to get an answer but no luck.
Talking about checking logs, I have checked container logs and resource manager logs and service-specific logs, there is really nothing that points out why would this error be happening.
I am using m4.4x large instances from AWS
yarn.nodemanager.resource.memory-mb: 50 GIB
Java Heap Size of ResourceManager in Bytes: 2GB
yarn.scheduler.maximum-allocation-mb : 25GB
Java Heap Size of NodeManager in Bytes: 2gb
yarn.nodemanager.resource.cpu-vcores: 14
yarn.scheduler.maximum-allocation-vcores : 8
yarn.nodemanager.resource.cpu-vcores and yarn.scheduler.maximum-allocation-vcores values are different because I have node manager groups and some of the instances are m4.2x large which have 8 cpus for node manager.
there fore I have taken the minimum of two for
yarn.scheduler.maximum-allocation-vcores : 8
Please suggest ifthere is something off in my configuration. This error happens randomly even when there are not a lot of jobs running.
Created 05-25-2018 05:29 AM
the exit code 137 indicates a resource issue. In most cases it will be the RAM.
You can try setting the yarn.scheduler.minimum-allocation-mb property to ensure a minimum a RAM available before yarn starts the job.
If this doesn't help, try dmesg to see the kernel messages, which should indicate why your job gets killed.
Created 05-25-2018 01:22 PM
@Harald Berghoff: Thank you for your response. I feel like I am in deep shit and really really need some help here. I have checked dmesg and it has not recorded any killed processed. We have all our jobs scheduled through oozie and we heavily depend on scheduled jobs. RAM on worker nodes, right? My worker nodes have 64 GB RAM and I can see free memory on nodes. From Resource Manager, I can see vCores getting used up before memory. Cluster has 225 GB memory and 54 VCores. For hosts I am using m4.4x instance. I can share my yarn configuration if you would like. Is there a way I can get some professional help here? I am okay with paid support for the issue.
Created 05-26-2018 06:09 AM
Yes, the resource lack is on the worker machine, where the container is executed. If you don't have OOM kills from the kernel on the worker machine (they would be reported via dmesg), there are of course other possible reasons. It could be the JVM settings as well.
Do you know which jobs get killed? Always hive jobs, or always spark jobs?
Created 05-28-2018 06:25 AM
I am not using spark. Both hive and sqoop jobs were getting killed. I increased the number of attempts to 5 and sqoop jobs are fine now but hive jobs are still getting stuck. Also, now, instead of 137 error, all my node managers are running into unexpected exit error. I can see about 181 timed waiting threads in resource manager but JVM heap memory usage seems fine.
Created 03-24-2019 02:06 PM
Hi Harald, I facing the same issue with spark jobs where executors are getting killed with exit status 137. Please let me know what can be the probable cause of it. I can't find any Kill mesg in dmesg.