Reply
Highlighted
Expert Contributor
Posts: 277
Registered: ‎01-25-2017

Killed tasks on crashed node don't have retry

[ Edited ]

Hi Guys,

 

One of the NodeManagers crashed and when i have looked at the application master i see that one task is killed but the Applicatiom master didn't start another task instead, also i see that the same task in pending state with the same id, when i'm trying to kill the task using mapred job -kill-task, i got it's killed but the UI still show it's pending.

 

The mapreduce.task.timeout is 10 minutes but the task was hanging for 30 minutes.

 

What i'm missing? how i can reenforce it to start the task on another node without killing the whole job, i'm using CDH5.5.4

Cloudera Employee
Posts: 251
Registered: ‎01-16-2014

Re: Killed tasks on crashed node don't have retry

The fact that a NM crashes does not mean that the containers on the node also crash. In CDH 5.5 and later you also have work recovery on the NM turned on. That means that a NM can be restarted without the containers being taken down and the NM after a restart will pick up the containers that are there. Status might not update until the NM is started again because the container communicates with the NM for that.

 

Did you restart the NM after it crashed?

Can you also explain where you saw that the task was hanging?

 

Wilfred

Expert Contributor
Posts: 277
Registered: ‎01-25-2017

Re: Killed tasks on crashed node don't have retry

[ Edited ]

The NM crashed and we didn't take it UP since it was needed a smart hand so we decide to wait with it.
I cann't restart the NM since the node was down and not reachable.

In the Yarn UI, when i clicked in the application master Url where it shows me the status of the mappers and reducers, i saw that 1 mapper was in pending status and that task was one NM that crashed.

 

@Wilfred Suppose we run the command shutdown -h now on the whole server, how the tasks that were on the NM on this node will be managed? assuning the node will be remain down.

 

Is there a way to enforce these tasks to move to another NM without killing the whole job since as i stated when i tried to killed this specific task, the CLI shows me the task killed but the in the application master i was still see that there is a pending mapper.

Expert Contributor
Posts: 277
Registered: ‎01-25-2017

Re: Killed tasks on crashed node don't have retry

Hi All,

 

Can someone please advise what is the right flow for the containers on a crashed node, should the containers start on another node or all the application should be killed? 

 

suppose i want to take the node down for few hours, what is the right steps to be taken so i can gurantee the containers will start on another node and will not be in pending state till the node started?

Cloudera Employee
Posts: 251
Registered: ‎01-16-2014

Re: Killed tasks on crashed node don't have retry

If you kill the container, i.e. the java process, that runs then the AM will time out the container, marked it as failed and start it on another node.

 

Wilfred

Cloudera Employee
Posts: 251
Registered: ‎01-16-2014

Re: Killed tasks on crashed node don't have retry

For normal maintenance: you decommission the node, or just the NM from the node which will remove it from the cluster and also make sure the RM is updated and the containers are shutdown.

 

Wilfred

Expert Contributor
Posts: 277
Registered: ‎01-25-2017

Re: Killed tasks on crashed node don't have retry

How i can kill a process on a node that i cannot reach, the node is
unreachable, the problem that i see in the application master under the
task type that the task in pending and under the attempt section i see it
as a killed, Why the task wasn't killed by the resource manager/
application master after the 10 minutes timeout?
Cloudera Employee
Posts: 251
Registered: ‎01-16-2014

Re: Killed tasks on crashed node don't have retry

If the node is completely of the network the delay for noticing the fact that the node and the container is gone is far longer. We fixed that via HADOOP-11252. However we did not turn that on by default. The timeout is infinite and we fall back to the TCP timeouts which can be really long in these cases.

 

Wilfred

Expert Contributor
Posts: 277
Registered: ‎01-25-2017

Re: Killed tasks on crashed node don't have retry

@Wilfred Hi Wilfred,

 

As you can see from the screen shoot, that the attempt discoverd as lost and that the elaspsed time is increasing even the atempt is killed.

 

You can see from the tasks also that no progress, while i'm trying to run the kill commnad, it consider it as killling attemp.

 

Screen Shot 2017-12-11 at 1.16.48 PM.pngScreen Shot 2017-12-11 at 1.17.11 PM.pngScreen Shot 2017-12-11 at 1.21.25 PM.pngScreen Shot 2017-12-11 at 1.21.37 PM.png

Announcements