I'm using HDP 2.4.2 and I'm getting some issues running simple test like TestDFSIO and Teragen.
If I execute the test with an high number of containers, some of them will be killed after 300 seconds, according to mapreduce.task.timeout property.
For example, running this command:
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teragen -Dmapreduce.job.queuename=ETL_APP -Dmapred.map.tasks=500 10000000000 /benchmarks/teragen_1T
results in 64 map tasks killed by timeout. On yarn I have 6.68 TB of RAM and 1330 vCores, and only few others applications are running, so no container of the teragen job are in waiting state. (the minimum size for a container is 2048MB).
The strange thing is that, if I look at the Application Master UI, I can see that those container are in state "RUNNING", but the status column is "NEW", until they will be killed by the Node Manager.
In the attached file therenm-container-id-container-e244-1515410092608-62086.txt is the log from the NM for one of those container. You can see that the container request arrived at 2018-01-11 11:24:41,445 but the container was killed at 11:30:13,532, without logging anything else. The next attempt was running fine in few minutes.
I then did another test, setting the property mapreduce.task.timeout to 600000 milliseconds and all containers started before the timeout, so the problem is not the container itself, but how long it tooks to start.
Did anyone knows why some containers took long time to start?