Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Containers are killed after timeout

Containers are killed after timeout

Rising Star

Hi all,

I'm using HDP 2.4.2 and I'm getting some issues running simple test like TestDFSIO and Teragen.

If I execute the test with an high number of containers, some of them will be killed after 300 seconds, according to mapreduce.task.timeout property.

For example, running this command:

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teragen -Dmapreduce.job.queuename=ETL_APP -Dmapred.map.tasks=500 10000000000 /benchmarks/teragen_1T

results in 64 map tasks killed by timeout. On yarn I have 6.68 TB of RAM and 1330 vCores, and only few others applications are running, so no container of the teragen job are in waiting state. (the minimum size for a container is 2048MB).

The strange thing is that, if I look at the Application Master UI, I can see that those container are in state "RUNNING", but the status column is "NEW", until they will be killed by the Node Manager.

In the attached file therenm-container-id-container-e244-1515410092608-62086.txt is the log from the NM for one of those container. You can see that the container request arrived at 2018-01-11 11:24:41,445 but the container was killed at 11:30:13,532, without logging anything else. The next attempt was running fine in few minutes.

I then did another test, setting the property mapreduce.task.timeout to 600000 milliseconds and all containers started before the timeout, so the problem is not the container itself, but how long it tooks to start.

Did anyone knows why some containers took long time to start?

Thank you very much,

Davide

Don't have an account?
Coming from Hortonworks? Activate your account here