Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

container exit code 137

avatar

Container exited with a non-zero exit code 137

Killed by external signal

This error randomly kills hive and sqoop jobs.

Is there anyone here who is willing to support? Been trying to get an answer but no luck.

Talking about checking logs, I have checked container logs and resource manager logs and service-specific logs, there is really nothing that points out why would this error be happening.

I am using m4.4x large instances from AWS

yarn.nodemanager.resource.memory-mb: 50 GIB

Java Heap Size of ResourceManager in Bytes: 2GB

yarn.scheduler.maximum-allocation-mb : 25GB

Java Heap Size of NodeManager in Bytes: 2gb

yarn.nodemanager.resource.cpu-vcores: 14

yarn.scheduler.maximum-allocation-vcores : 8

yarn.nodemanager.resource.cpu-vcores and yarn.scheduler.maximum-allocation-vcores values are different because I have node manager groups and some of the instances are m4.2x large which have 8 cpus for node manager.

there fore I have taken the minimum of two for

yarn.scheduler.maximum-allocation-vcores : 8

Please suggest ifthere is something off in my configuration. This error happens randomly even when there are not a lot of jobs running.

5 REPLIES 5

avatar
Super Collaborator

the exit code 137 indicates a resource issue. In most cases it will be the RAM.

You can try setting the yarn.scheduler.minimum-allocation-mb property to ensure a minimum a RAM available before yarn starts the job.

If this doesn't help, try dmesg to see the kernel messages, which should indicate why your job gets killed.

https://github.com/moby/moby/issues/22211

avatar

@Harald Berghoff: Thank you for your response. I feel like I am in deep shit and really really need some help here. I have checked dmesg and it has not recorded any killed processed. We have all our jobs scheduled through oozie and we heavily depend on scheduled jobs. RAM on worker nodes, right? My worker nodes have 64 GB RAM and I can see free memory on nodes. From Resource Manager, I can see vCores getting used up before memory. Cluster has 225 GB memory and 54 VCores. For hosts I am using m4.4x instance. I can share my yarn configuration if you would like. Is there a way I can get some professional help here? I am okay with paid support for the issue.

avatar
Super Collaborator

Yes, the resource lack is on the worker machine, where the container is executed. If you don't have OOM kills from the kernel on the worker machine (they would be reported via dmesg), there are of course other possible reasons. It could be the JVM settings as well.

Do you know which jobs get killed? Always hive jobs, or always spark jobs?

avatar

I am not using spark. Both hive and sqoop jobs were getting killed. I increased the number of attempts to 5 and sqoop jobs are fine now but hive jobs are still getting stuck. Also, now, instead of 137 error, all my node managers are running into unexpected exit error. I can see about 181 timed waiting threads in resource manager but JVM heap memory usage seems fine.

avatar
New Contributor

Hi Harald, I facing the same issue with spark jobs where executors are getting killed with exit status 137. Please let me know what can be the probable cause of it. I can't find any Kill mesg in dmesg.