Created on 01-26-2017 05:14 AM
Kafka's mirroring feature makes it possible to maintain a replica of an existing Kafka cluster. This tool uses Kafka consumer to consume messages from the source cluster, and re-publishes those messages to the target cluster using an embedded Kafka producer.
To set up a mirror, run
kafka.tools.MirrorMaker. The following table
lists configuration options.
At a minimum, MirrorMaker requires one or more consumer configuration files, a producer configuration file, and either a whitelist or a blacklist of topics. In the consumer and producer configuration files, point the consumer to the the source cluster, and point the producer to the destination (mirror) cluster, respectively.
bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config sourceCluster1Consumer.config --consumer.config sourceCluster2Consumer.config --num.streams 2 --producer.config targetClusterProducer.config --whitelist=".*"
Specifies a file that contains configuration settings for the source cluster. For more information about this file, see the "Consumer Configuration File" subsection.
Specifies the file that contains configuration settings for the target cluster. For more information about this file, see the "Producer Configuration File" subsection.
(Optional) For a partial mirror, you can specify exactly one comma-separated list of topics to include (--whitelist) or exclude (--blacklist).
In general, these options accept Java regex patterns. For caveats, see the note after this table.
Specifies the number of consumer stream threads to create.
Specifies the number of producer instances. Setting this to a value greater than one establishes a producer pool that can increase throughput.
Queue size: number of messages that are buffered, in terms of number of messages between the consumer and producer. Default = 10000.
List MirrorMaker command-line options.
-white-list=".*", MirrorMaker tries to fetch data from the system-level topic
__consumer-offsetsand produce that data to the target cluster. Make sure you added
exclude.internal.topics=truein consumer properties
Workaround: Specify topic names, or to replicate all topics, specify
Consumer config bootstrap.servers should point to source cluster
Here is a sample consumer configuration file:
bootstrap.servers=kafka-source-server1:6667,kafka-source-server2:6667,kafka-source-server3-6667 groupid=dp-MirrorMaker-group exclude.internal.topics=true mirror.topics.whitelist=app_log client.id=mirror_maker_consumer
Producer config bootstrap.servers should point to target cluster
Here is a sample producer configuration file:
bootstrap.servers=kafka-target-server1:6667,kafka-target-server2:6667,kafka-target-server3-6667 acks=1 batch.size=100 client.id=mirror_maker_producer
If you have consumers that are going to consume data from target cluster and your parallelism requirement for a consumer is same as your source cluster, Its important that you create a same topic in target cluster with same no.of partitions.
If you have a topic name called "click-logs" with 6 partitions in source cluster , make sure you have same no.of partitions in the target cluster. If you are using a target cluster as more of a backup, not active this might not need to be same.
If users didn't create a topic in target cluster, producer in mirrormaker will attempt to create a topic and target cluster broker will create a topic with configured num.partitions and num.replicas, this may not be the partitions and replication that the user wants.
We recommend to run MirrorMaker on target cluster.
Make sure you've following configs in consumer config and producer config for No data loss.
For Consumer, set auto.commit.enabled=false in consumer.properties
For MirrorMaker, set --abortOnSendFail
The following actions will be taken by MirrorMaker
As the last point stated if there is any error occurred your mirror maker process will be killed. So users are recommend to use a watchdog process like supervisord to restart the killed mirrormaker process.
num.streams option in mirror-maker allows you to create specified no.of consumers.
Mirror-Maker deploys the specified no.of threads in num.streams
Keep in mind, a topic-partition is the unit of parallelism in Kafka. If you have a topic called "click-logs" with 6 partitions then max no.of consumers you can run is 6. If you run more than 6 , additional consumers will be idle and if you run less than 6 , all 6 partitions will be distributed among available consumers. More partitions leads to more throughput.
So before going further into num.streams, we recommend you to run multiple instances of mirror-maker across the machines with same "groupId" in consumer.config. This will help if a mirror-maker process goes down for any reason and the topic-partitions owned by killed mirror-maker will be re-balanced among other running mirror-maker processes. So this will give high-availability of mirror-maker.
Coming back to choosing num.streams. Lets say you've 3 topics with 4 partitions each and you are running 3 mirror maker processes. You can choose 4 as your num.streams this way each instance of mirror-maker starts 4 consumers reading 4 topic-partitions each and writing to target cluster.
If you just run 1 mirror maker by choosing 4 as num.streams then 4 consumers will be reading from all 12 topic-partitions . This means lot more traffic into a single machine and it will be slower. Also if the mirror-maker process is stopped there are no other mirror-maker processes to take over.
For maximum performance, total number of num.streams should match all of the topic partitions that the mirror maker trying to copy to target cluster.
One co-locate more than one mirror maker in a single machine. Always run more than one mirror-maker processes. Make sure you use the same groupId in consumer config.
In general, you should set a high value for the socket buffer size on the mirror-maker's consumer configuration (socket.buffersize) and the source cluster's broker configuration (socket.send.buffer). Also, the mirror-maker consumer's fetch size (fetch.size) should be higher than the consumer's socket buffer size. Note that the socket buffer size configurations are a hint to the underlying platform's networking code.
The consumer offset checker tool is useful to gauge how well your mirror is keeping up with the source cluster. Note that the --zkconnect argument should point to the source cluster's ZooKeeper. Also, if the topic is not specified, then the tool prints information for all topics under the given consumer group.
bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group KafkaMirror --zkconnect dc1-zookeeper:2181 --topic test-topic
minimal lag here indicates healthy Mirror-Maker
We recommend to use SSL in mirror-maker from kafka 0.10.x version onwards. You can learn more details about setting SSL for brokers, producers and consumers here
Users can share the same key & trust stores for both producer & consumer in Mirror-Maker. Make sure you give read & write permissions for the certificate/hostname if you are using a authorizer with SSL.
In kafka 0.9.x and 0.10.0.1, 0.10.1.0 , consumers and producers in mirror-maker cannot run with different principals/keytabs as they both run inside a single JVM. So the users need to use single principal to configure both consumer and producer. This means same principal needs to have at least read & describe access on the source cluster topics and write & describe access to topics on target cluster.
In future version of kafka users can configure different principal/keytabs for consumer & producer in mirror-maker.
Created on 01-30-2017 10:54 PM
Thanks @Sriharsha Chintalapani for bringing out this article. A much needed one with growing importance of Kafka in every data-centric organisation. Covers lot of ground from MirrorMaker perspective. Thanks
Created on 01-09-2019 02:34 AM
Helped me big time! Thank you for sharing! Kafka all-the-way!