when i run short jobs the container take more time to load and the actual job,
for example it take sometimes a over 60 seconds to start my process because for each core in the computer a new container is generated,
is it possible to configure the nodemanager not to kill the container? and reuse it when the same cpu/ram is requested?
i found out that the container is reused only while this job still active,, and in my case used between 3-10 times.
can it force it to stay until the resources (cpu/ram) are needed for different (cpu/ram ) requirement across other jobs
It indicates that your Cluster might not have enough resource Or you might be running some unwanted services to your cluster. Either increase resources to your cluster nodes like RAM ... Or remove unwanted services from the cluster So that the containers can be started bit fast.
the server node have 32gb ram, and he only accept spark submit jobs (he does not act as client\worker)
each worker node is one of two servers types:
16 core 64GB or 48 core 196GB
and the workers nodes have only Metrics Monitor / NodeManager installed
all the configuration is on default.
when running large job i don't mind the minute hold up, but when running short job should be over under 1 minute (for example 500 jobs (each take 30 seconds on one core) should be over under 1 minute when have enough cpu\ram to allocate,
i think that the problem is the delay of actual job starting time (i can see the process start on by running top on command line on shell on the worker) 30-60 seconds after the submit is received., i see some java tasks manly regarding the creation on the container