Created 01-20-2016 04:56 PM
What are best practices for Deploying Storm Components on a cluster for scalablity and growth? We are thinking of having dedicated nodes for Storm on YARN. Also would anything go on an edge node?
For example in a cluster, the thought is to have three Storm nodes (S1, S2, S3) dedicated with the following allocations:
Storm Nimbus:
Storm Supervisors/ Workers
Zookeeper Cluster
Storm UI
DRPC Server
So in Summary, if we have three dedicated nodes for Storm, the thinking is to allocate as follows:
S1 Node:
S2 Node:
S3 Node:
Edge Node:
Finally would the DRPC go on the Nimbus node? Any thoughts on this? Am I on the right track? Would anything go on an edge node?
Created 01-21-2016 06:14 AM
Hi @Ancil McBarnett my 2 cents:
Created 01-20-2016 05:02 PM
storm supervisors on their own nodes, kafka brokers can be collocated with datanodes, that's my findings from our recent POC. I can give you more detail on phone. @Ancil McBarnett
Created 01-20-2016 05:02 PM
@Ancil McBarnett Have you looked at this guide?
Created 01-20-2016 05:02 PM
Yes, but it does not go into deployment from a cluster topology point of view (except for the discussion on zookeeper) @Wes Floyd
Created 01-21-2016 06:14 AM
Hi @Ancil McBarnett my 2 cents:
Created 01-21-2016 07:31 PM
@Predrag Minovic why not Supervisors on their own nodes, not on Data nodes?
Created 01-21-2016 07:32 PM
@Ancil McBarnett @tgoetz suggests to put supervisors on their own.
Created 01-21-2016 11:08 PM
@Ancil McBarnett Oh yes, supervisors definitely on dedicated nodes if you have enouhg nodes. I updated my answer.
Created 02-04-2016 04:43 PM
As per
ZK on separate nodes from Kafka Broker. Do Not Install zk nodes on the same node as kafka broker if you want optimal Kafka performance. Disk I/O both kafka and zk are disk I/O intensive
Created 01-21-2016 10:52 PM
This is not a complete answer, but would like to also add that, by default, Kafka brokers write to local storage (not HDFS), and therefore, benefit from fast local disk (SSD) and/or multiple spindles to parallelize writes to partitions. I don't know of a formula to calculate this, but try to maximize I/O throughput to disk, and allocate # spindles up to the # of available CPUs per node. Lots of Hadoop architectures don't really specify allocation for local storage (beyond OS disk), and therefore it is something to be aware of.