Kafka's mirroring feature makes it possible to maintain a replica of an existing Kafka cluster. This tool uses Kafka consumer to consume messages from the source cluster, and
re-publishes those messages to the target cluster using an
embedded Kafka producer.
Running Mirror Maker
To set up a mirror, run kafka.tools.MirrorMaker. The following table
lists configuration options.
At a minimum, MirrorMaker requires one or more consumer configuration files, a
producer configuration file, and either a whitelist or a blacklist of topics. In the
consumer and producer configuration files, point the consumer to the the source cluster, and point the producer to the destination (mirror) cluster, respectively.
Specifies a file that contains configuration settings
for the source cluster. For more information about this file, see
the "Consumer Configuration File" subsection.
Specifies the file that contains configuration settings
for the target cluster. For more information about this file, see
the "Producer Configuration File" subsection.
(Optional) For a partial mirror, you can specify
exactly one comma-separated list of topics to include
(--whitelist) or exclude (--blacklist).
In general, these options accept Java regex patterns. For caveats, see the note after this
Specifies the number of consumer stream threads to
Specifies the number of producer instances. Setting
this to a value greater than one establishes a producer pool that
can increase throughput.
Queue size: number of messages that are buffered, in
terms of number of messages between the consumer and producer.
Default = 10000.
List MirrorMaker command-line options.
A comma (',') is interpreted as the regex-choice symbol ('|')
If you specify --white-list=".*", MirrorMaker tries to
fetch data from the system-level topic
__consumer-offsets and produce that data to the
target cluster. Make sure you added exclude.internal.topics=true in consumer properties
Workaround: Specify topic names, or to replicate all topics, specify
Example Consumer config file
Consumer config bootstrap.servers should point to source cluster
If you have consumers that are going to consume data from target cluster and your parallelism requirement for a consumer is same as your source cluster, Its important that you create a same topic in target cluster with same no.of partitions.
If you have a topic name called "click-logs" with 6 partitions in source cluster , make sure you have same no.of partitions in the target cluster. If you are using a target cluster as more of a backup, not active this might not need to be same.
If users didn't create a topic in target cluster, producer in mirrormaker will attempt to create a topic and target cluster broker will create a topic with configured num.partitions and num.replicas, this may not be the partitions and replication that the user wants.
Where to run MirrorMaker
We recommend to run MirrorMaker on target cluster.
No data loss
Make sure you've following configs in consumer config and producer config for No data loss.
For Consumer, set auto.commit.enabled=false in consumer.properties
For MirrorMaker, set --abortOnSendFail
The following actions will be taken by MirrorMaker
Mirror maker will only send one request to a broker at any given point.
If any exception is caught in mirror maker thread, mirror maker will try to commit the acked offsets then exit immediately.
RetriableException in producer, producer will retry indefinitely. If
retry did not work, eventually the entire mirror maker will block on
producer buffer full.
For None-retriable exception, if
--abort.on.send.fail is specified, stops the mirror maker. Otherwise
producer callback will record the message that was not successfully sent
but let the mirror maker move on. In this case, that message will be lost in target cluster.
As the last point stated if there is any error occurred your mirror maker process will be killed. So users are recommend to use a watchdog process like supervisord to restart the killed mirrormaker process.
Mirror Maker Sizing
num.streams option in mirror-maker allows you to create specified no.of consumers.
Mirror-Maker deploys the specified no.of threads in num.streams
Each thread instantiates and uses one consumer. That is a 1:1 mapping between mirror-maker threads and consumers.
Each thread shares the same producer. That is a N:1 mapping between threads and producers.
Keep in mind, a topic-partition is the unit of parallelism in Kafka. If you have a topic called "click-logs" with 6 partitions then max no.of consumers you can run is 6. If you run more than 6 , additional consumers will be idle and if you run less than 6 , all 6 partitions will be distributed among available consumers. More partitions leads to more throughput.
So before going further into num.streams, we recommend you to run multiple instances of mirror-maker across the machines with same "groupId" in consumer.config. This will help if a mirror-maker process goes down for any reason and the topic-partitions owned by killed mirror-maker will be re-balanced among other running mirror-maker processes. So this will give high-availability of mirror-maker.
Coming back to choosing num.streams. Lets say you've 3 topics with 4 partitions each and you are running 3 mirror maker processes. You can choose 4 as your num.streams this way each instance of mirror-maker starts 4 consumers reading 4 topic-partitions each and writing to target cluster.
If you just run 1 mirror maker by choosing 4 as num.streams then 4 consumers will be reading from all 12 topic-partitions . This means lot more traffic into a single machine and it will be slower. Also if the mirror-maker process is stopped there are no other mirror-maker processes to take over.
For maximum performance, total number of num.streams should match all of the topic partitions that the mirror maker trying to copy to target cluster.
One co-locate more than one mirror maker in a single machine. Always run more than one mirror-maker processes. Make sure you use the same groupId in consumer config.
Socket buffer sizes
In general, you should set a high value for the socket buffer size on the mirror-maker's consumer configuration (socket.buffersize) and the source cluster's broker configuration (socket.send.buffer). Also, the mirror-maker consumer's fetch size (fetch.size)
should be higher than the consumer's socket buffer size. Note that the
socket buffer size configurations are a hint to the underlying
platform's networking code.
Check the health of Mirror-Maker
The consumer offset checker tool is useful to gauge how well your mirror is keeping up with the source cluster. Note that the --zkconnect argument
should point to the source cluster's ZooKeeper.
Also, if the topic is not specified, then the tool prints information
for all topics under the given consumer group.
Users can share the same key & trust stores for both producer & consumer in Mirror-Maker. Make sure you give read & write permissions for the certificate/hostname if you are using a authorizer with SSL.
In kafka 0.9.x and 0.10.0.1, 0.10.1.0 , consumers and producers in mirror-maker cannot run with different principals/keytabs as they both run inside a single JVM. So the users need to use single principal to configure both consumer and producer. This means same principal needs to have at least read & describe access on the source cluster topics and write & describe access to topics on target cluster.
In future version of kafka users can configure different principal/keytabs for consumer & producer in mirror-maker.