Created 04-26-2016 01:18 AM
Hi
I am building a data lake with hdp where kafka will be used to ingest all the data.
I have two options. One cluster for everything and kafka is deployed exclusively on some node. One hdp cluster with storage and proceesing and another cluster with only kafka.
What's the best approach ? Pros and cons ?
How to size my kafka part ?
Created 04-26-2016 02:43 AM
Since you plan dedicated Kafka nodes in your "cluster for everything" then Kafka performance will be the same in comparison to a stand-alone Kafka cluster. However, it's good to have a dedicated Zookeeper quorum for Kafka, and in the first option Ambari currently doesn't support 2 ZK quorums per cluster, so you will need to install your ZK for Kafka manually. That's not so complicated, but if you go for a stand-alone Kafka solution, you can use Ambari to install and manage your ZK. So, my recommendation is to go for a stand-alone Kafka cluster.
Created 04-26-2016 02:43 AM
Since you plan dedicated Kafka nodes in your "cluster for everything" then Kafka performance will be the same in comparison to a stand-alone Kafka cluster. However, it's good to have a dedicated Zookeeper quorum for Kafka, and in the first option Ambari currently doesn't support 2 ZK quorums per cluster, so you will need to install your ZK for Kafka manually. That's not so complicated, but if you go for a stand-alone Kafka solution, you can use Ambari to install and manage your ZK. So, my recommendation is to go for a stand-alone Kafka cluster.
Created 05-04-2016 08:21 PM
@Predrag Minovic, can you explain why Kafka needs its own Zk quorum? Why can't it utilize an existing Zk quorum? We are migrating to Kafka in production and I would like to get your take on this.
Created 10-08-2016 11:49 AM
@David Lays Please let me know what final Kafka design approach you went with; Kafka on Cluster node or separate Kafka cluster. We are also facing exactly same design dilemma with regards to Kafka installation for Cluster.
Thanks very much in advance.