Created 08-30-2016 12:07 AM
Say, if there were 3 extra slots, and a supervisor with 4 slots (supervisor.slots.ports) go down, what happens?? - Does storm automatically increase the number of executor threads in worker process from other supervisors?
Created 08-30-2016 12:19 PM
Before I answer the question specifically, let me address (based on my research) the fault tolerance for each of the components within Storm:
1) Nimbus - A stateless daemon which sits on master node which deploys the job (Topology) and keeps track of it. Now there are two scenarios. First, If Nimbus goes down after you submit the topology it will not have any adverse affect on current topology as it is running on worker nodes (and not on the master node), Now if you have kept this process under supervision the process will restart and when your nimbus comes back as it is fail-fast it will retrieve all the meta information of all the active topologies on the cluster from Zookeeper and start tracking them again. Second, If Nimbus goes down before you submit the topology you will simply have to restart it.
2) Supervisor - This daemon is responsible to keep track of worker processes (JVM Process) on the node he sits and coordinate the state with Nimbus through Zookeeper. If this daemon goes down your worker process will not be affected and it will keep on running unless it doesn't get crashed, once it comes back (due to supervisord or monit) it will collect the state from Zookeeper and resume tracking the worker processes. If timeout occurs Nimbus will reschedule the topology on different worker node.
3) Worker Processes (JVM Processes) - These container processes actually execute your topology’s components (Spouts + Bolts), If it goes down Supervisor will simply restart them on different ports(on same worker node) and if it is out of ports it will notify Nimbus and then it will reschedule the process on different worker node.
4) Worker Node (Supervisor + Worker Process) - During this scenario Nimbus will stop receiving the heartbeats (due to timeout) from the Worker Node and then the Nimbus will simply reassign the work to different Worker Node(s) in the cluster.
5) Zookeeper (Zk) - Okay! From all the above you might have infered that all the state gets stored on Zk, Hmmh! What if it goes down or can it go down? Zk is again not a single node process, it has its own cluster and the state stored in Zk is constantly replicated, so even if a single Zk node goes down, the new node will be elected as a leader which will start communicating with Apache Storm.
Now, going back to the specific question:
And from design perspective itself, you should not necessarily counter for redundant ports as Nimbus is designed to take care of this by either restarting the processes on that port or by re-distributing among other worker nodes.
Created 08-30-2016 05:32 AM
if worker goes down then supervisor will try to restart it, unable to do so after trying multiple times Nimbus will reassign tasks to another worker slots.Also in the event of worker node failure Nimbus will reassign the tasks on other nodes.
Created 08-30-2016 12:14 PM
Thanks @Rajkumar Singh
Created 09-26-2016 09:02 AM
what happens if after continuous worker failuers ultimately all tasks are rescheduled to start on one worker ?
Created 08-30-2016 12:19 PM
Before I answer the question specifically, let me address (based on my research) the fault tolerance for each of the components within Storm:
1) Nimbus - A stateless daemon which sits on master node which deploys the job (Topology) and keeps track of it. Now there are two scenarios. First, If Nimbus goes down after you submit the topology it will not have any adverse affect on current topology as it is running on worker nodes (and not on the master node), Now if you have kept this process under supervision the process will restart and when your nimbus comes back as it is fail-fast it will retrieve all the meta information of all the active topologies on the cluster from Zookeeper and start tracking them again. Second, If Nimbus goes down before you submit the topology you will simply have to restart it.
2) Supervisor - This daemon is responsible to keep track of worker processes (JVM Process) on the node he sits and coordinate the state with Nimbus through Zookeeper. If this daemon goes down your worker process will not be affected and it will keep on running unless it doesn't get crashed, once it comes back (due to supervisord or monit) it will collect the state from Zookeeper and resume tracking the worker processes. If timeout occurs Nimbus will reschedule the topology on different worker node.
3) Worker Processes (JVM Processes) - These container processes actually execute your topology’s components (Spouts + Bolts), If it goes down Supervisor will simply restart them on different ports(on same worker node) and if it is out of ports it will notify Nimbus and then it will reschedule the process on different worker node.
4) Worker Node (Supervisor + Worker Process) - During this scenario Nimbus will stop receiving the heartbeats (due to timeout) from the Worker Node and then the Nimbus will simply reassign the work to different Worker Node(s) in the cluster.
5) Zookeeper (Zk) - Okay! From all the above you might have infered that all the state gets stored on Zk, Hmmh! What if it goes down or can it go down? Zk is again not a single node process, it has its own cluster and the state stored in Zk is constantly replicated, so even if a single Zk node goes down, the new node will be elected as a leader which will start communicating with Apache Storm.
Now, going back to the specific question:
And from design perspective itself, you should not necessarily counter for redundant ports as Nimbus is designed to take care of this by either restarting the processes on that port or by re-distributing among other worker nodes.