Created 06-16-2016 04:04 PM
We are encountering a situation where a storm worker dies during processing due to a sporadic problem on the node. Unfortunately, the worker process is repeatedly relaunched witht he same topology on the same node where the problem has occurred instead of being launched elsewhere. There are plenty of slots so we would expect the system to realise that the topology is failing there and to place the topology components else. Any suggestions on what to look at?
Created 06-17-2016 02:48 PM
When a worker dies, it is restarted by supervisor. Only when the failures are on startup and it is unable to heartbeat to Nimbus, it will be reassigned to another machine.
http://storm.apache.org/releases/current/Fault-tolerance.html
Created 06-17-2016 03:05 PM
Thanks for the response. I understand this behavior, but am wondering how to get out of the situation as the worker simply sits there restarting. Is there anything that can be done to stop/delay the heartbeat as it would seem that the restart is happening fast enough to keep nimbus from redirecting the worker to a different node where the worker could run properly.
Created 06-18-2016 12:27 PM
One way to stop that would be to shutdown the supervisor.
I am also wondering why is it that you don't want your worker to be restarted on the same node.
Created 06-22-2016 12:02 AM
Thanks for the response. There are a number of reasons why you wouldn't want a restart on the same node. In our case, a file was corrupted and was repeated causing problems. You could also encounter the issue if you ran out of disk space or had other process dependencies that were failing for some reason.