Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Dying storm worker restarting on same node

Dying storm worker restarting on same node

New Contributor

We are encountering a situation where a storm worker dies during processing due to a sporadic problem on the node. Unfortunately, the worker process is repeatedly relaunched witht he same topology on the same node where the problem has occurred instead of being launched elsewhere. There are plenty of slots so we would expect the system to realise that the topology is failing there and to place the topology components else. Any suggestions on what to look at?

4 REPLIES 4

Re: Dying storm worker restarting on same node

Explorer

When a worker dies, it is restarted by supervisor. Only when the failures are on startup and it is unable to heartbeat to Nimbus, it will be reassigned to another machine.

http://storm.apache.org/releases/current/Fault-tolerance.html

Re: Dying storm worker restarting on same node

New Contributor

Thanks for the response. I understand this behavior, but am wondering how to get out of the situation as the worker simply sits there restarting. Is there anything that can be done to stop/delay the heartbeat as it would seem that the restart is happening fast enough to keep nimbus from redirecting the worker to a different node where the worker could run properly.

Highlighted

Re: Dying storm worker restarting on same node

Explorer

One way to stop that would be to shutdown the supervisor.

I am also wondering why is it that you don't want your worker to be restarted on the same node.

  • Is there a problem with the node ? In that case you shouldn't have supervisor running on it.
  • Does the topology jar has all the necessary libraries & configuration files ?

Re: Dying storm worker restarting on same node

New Contributor

Thanks for the response. There are a number of reasons why you wouldn't want a restart on the same node. In our case, a file was corrupted and was repeated causing problems. You could also encounter the issue if you ran out of disk space or had other process dependencies that were failing for some reason.

Don't have an account?
Coming from Hortonworks? Activate your account here