I am trying to perform Kafka cluster migration from one cluster to another cluster using Mirror Maker 2.0. I am using Kafka 2.8.1 which has good support for consumer-group replication and offset sync.
I have encountered a scenario where I want to migration a topic along with its producers and consumers from source cluster to target cluster.
Example:
I have topic A which I want to migrate, it has 3 partitions
source -> topic = A
destination -> topic = source.A
topic A is replicated as source.A in target cluster
I have a consumer-group topic-a-reader-group created at source cluster with 3 consumers. It is replicated at destination cluster with same name, I have created 1 consumers in this group at destination cluster.
Now when I producer messages at source cluster to topic A, it is consumed by 3 consumers are source cluster and this message will also get replicated at target cluster which is consumer by the consumer present in the consumer-group running at target cluster. Basically the message is read both by source consumer and target consumer. Duplicate read for consumer ultimately.
I want only one consumer to get this message, not to duplicate at source and destination. In my application, I can't just turn off the consumer at source and move it to target cluster(time critical application). So I want to keep both consumers at source and target running for sometime and turn off source consumer after some duration where both are running active.
Is there any idempotence property available for consumer-group which let only one consumer-group to read the message in mirror maker scenario without being duplicated at source and target cluster?
Please suggest if there are any other approach to move consumers from source to target cluster without downtime.
Related
We have 2 diff kafka clusters with 10 brokers in each cluster and each cluster has its own Zookeeper cluster. We also have setup MirrorMaker 2 to sync data between the clusters. With MM2, the offset is also being synced along with data.
Looking forward to setup Active/Active for my consumer application as well as producer application.
Lets say the clusters are DC1 & DC2.
Topic name is test-mm.
With MM2 setup,
In DC1,
test-mm
test-mm-DC2(Mirror of DC2)
In DC2,
test-mm
test-mm-DC1(Mirror of DC1)
Consumer Active/ Active
In DC1, I have an application consuming data from test-mm & test-mm-DC2 with the consumer group name group1-test.
In DC2, The same application is consuming data from test-mm & test-mm-DC1 with the consumer group name group1-test.
Application is running as Active/Active on both DCs.
Now producer in DC1 is producing to the topic test-mm in DC1 and it gets mirrored to the topic test-mm-DC1 in DC2. My assumption here is, the offset gets synced so, with the same consumer group name, we can run consumer application on both DCs and only one consumer will get and process the message. Also, when the consumer application in DC1 goes down, the consumer application in DC2 will start processing and we can achieve the real active/active for consumers. Is this correct?
Producer active/active,
It may not be possible with Producer in DC1 and Producer 2 in DC2 as the sequence may not be maintained with 2 different producer. Not sure if Active/Active can be achieved with producer.
You will want two producers, one producing to test-mm in DC1 and the other producing to test-mm in DC2. Once messages have been produced to test-mm in DC1 this will be replicated to test-mm-DC1 in DC2 and vice versa. This is achieving active / active as the data will exist on both DCs, your consumers are also consuming from both DCs and if one DC fails the other producer and consumer will continue as normal. Please let me know if this has not answered your question.
Hopefully my comment answers your question about exactly once processing with MM2. The Stack Overflow post I linked takes the following paragraph from the IBM guide: https://ibm-cloud-architecture.github.io/refarch-eda/technology/kafka-mirrormaker/#record-duplication
This Cloudera blog also mentions that exactly once processing does not apply across multiple clusters: https://blog.cloudera.com/a-look-inside-kafka-mirrormaker-2/
Cross-cluster Exactly Once Guarantee
Kafka provides support for exactly-once processing but that guarantee
is provided only within a given Kafka cluster and does not apply
across multiple clusters. Cross-cluster replication cannot directly
take advantage of the exactly-once support within a Kafka cluster.
This means MM2 can only provide at least once semantics when
replicating data across the source and target clusters which implies
there could be duplicate records downstream.
Now with regards to the below question:
Now producer in DC1 is producing to the topic test-mm in DC1 and it
gets mirrored to the topic test-mm-DC1 in DC2. My assumption here is,
the offset gets synced so, with the same consumer group name, we can
run consumer application on both DCs and only one consumer will get
and process the message. Also, when the consumer application in DC1
goes down, the consumer application in DC2 will start processing and
we can achieve the real active/active for consumers. Is this correct?
See this post here, they ask a similar question: How are consumers setup in Active - Active Kafka setup
I've not configured MM2 in an active/active architecture before so can't confirm whether you would have two active consumers for each DC or one. Hopefully another member will be able to answer this question for you.
I am ingesting data into Druid from Kafka's topic. Now I want to migrate my Kafka Topic to the new Kafka Cluster. What are the possible ways to do this without duplication of data and without downtime?
I have considered below possible ways to migrate Topic to the new Kafka Cluster.
Manual Migration:
Create a topic with the same configuration in the new Kafka cluster.
Stop pushing data in the Kafka cluster.
Start pushing data in the new cluster.
Stop consuming from the old cluster.
Start consuming from the new cluster.
Produce data in both Kafka clusters:
Create a topic with the same configuration in the new Kafka cluster.
Start producing messages in both Kafka clusters.
Change Kafka topic configration in Druid.
Reset Kafka topic offset in Druid.
Start consuming from the new cluster.
After successful migration, stop producing in the old Kafka cluster.
Use Mirror Maker 2:
MM2 creates Kafka's topic in a new cluster.
Start replicating data in both clusters.
Move producer and consumer to the new Kafka cluster.
The problem with this approach:
Druid manages Kafka topic's offset in its metadata.
MM2 will create two topics with the same name(with prefix) in the new cluster.
Does druid support the topic name with regex?
Note: Druid manages Kafka topic offset in its metadata.
Druid Version: 0.22.1
Old Kafka Cluster Version: 2.0
Maybe a slight modification of your number 1:
Start publishing to the new cluster.
Wait for the current supervisor to catch up all the data in the old topic.
Suspend the supervisor. This will force all the tasks to write and publish the segments. Wait for all the tasks for this supervisor to succeed. This is where "downtime" starts. All of the currently ingested data is still queryable while we switch to the new cluster. New data is being accumulated in the new cluster, but not being ingested in Druid.
All the offset information of the current datasource are stored in Metadata Storage. Delete those records using
delete from druid_dataSource where datasource={name}
Terminate the current supervisor.
Submit the new spec with the new topic and new server information.
You can follow these steps:
1- On your new cluster, create your new topic (the same name or new name, doesn't matter)
2- Change your app config to send messages to new kafka cluster
3- Wait till druid consume all messages from the old kafka, you can ensure when data is being consumed by checking supervisor's lagging and offset info
4- Suspend the task, and wait for the tasks to publish their segment and exit successfully
5- Edit druid's datasource, make sure useEarliestOffset is set to true, change the info to consume from new kafka cluster (and new topic name if it isn't the same)
6- Save the schema and resume the task. Druid will hit the wall when checking the offset, because it cannot find them in new kafka, and then it starts from the beginning
Options 1 and 2 will have downtime and you will lose all data in the existing topic.
Option 2 cannot guarantee you wont lose data or generate duplicates as you try to send messages to multiple clusters together.
There will be no way to migrate the Druid/Kafka offset data to the new cluster without at least trying MM2. You say you can reset the offset in Option 2, so why not do the same with Option 3? I haven't used Druid, but it should be able to support consuming from multiple topics, with pattern or not. With option 3, you don't need to modify any producer code until you are satisfied with the migration process.
I am using MirrorMaker2 for DR.
Kafka 2.7 should support
automated consumer offset sync
Here is the yaml file I am using (I use strimzi for creating it)
All source cluster topics are replicated in destination cluster.
Also ...checkpoint.internal topic is created in destination cluster that contains all source cluster offsets synced, BUT I don't see these offsets being translated into destination cluster _consumer_offsets topic which means when I will start consumer (same consumer group) in destination cluster it will start reading messages from the beginning.
My expectation is that after allowing automated consumer offsets sync all consumer offsets from source clusters translated and stored in _consumer_offsets topic in the destination cluster.
Can someone please clarify if my expectation is correct and if not how it should work.
The sync.group.offsets.enabled setting is for MirrorCheckpointConnector.
I'm not entirely sure how Strimzi runs MirrorMaker 2 but I think you need to set it like:
checkpointConnector:
config:
checkpoints.topic.replication.factor: 1
sync.group.offsets.enabled: "true"
We're using MirrorMaker2 to replicate some topics from one kerberized kafka cluster to another kafka cluster (strictly unidirectional). We don't control the source kafka cluster and we're given only access to describe and read specific topics that are to be consumed.
MirrorMaker2 creates and maintains a topic (mm2-offset-syncs) in the source cluster to encode cluster-to-cluster offset mappings for each topic-partition being replicated and also creates an AdminClient in the source cluster to handle ACL/Config propagation. Because MM2 needs authorization to create and write to these topics in the source cluster, or to perform operations through AdminClient, I'm trying to understand why/if we need these mechanisms in our scenario.
My question is:
In a strictly unidirectional scenario, what is the usefulness of this source-cluster offset-sync topic to Mirrormaker?
If indeed it's superfluous, is it possible to disable it or operate mm2 without access to create/produce to this topic?
If ACL and Config propagation is disabled, is it safe to assume that the AdminClient is not used for anything else?
In the MirrorMaker code, the offset-sync topic it is readily created by MirrorSourceConnector when it starts and then maintained by the MirrorSourceTask. The same happens to AdminClient in the MirrorSourceConnector.
I have found no way to toggle off these features but honestly I might be missing something in my line of thought.
There is an option inroduced in Kafka 3.0 to make MM2 not to create the mm2-offset-syncs topic in the source cluster and operate on it in the target cluster.
Thanks to the KIP-716: https://cwiki.apache.org/confluence/display/KAFKA/KIP-716%3A+Allow+configuring+the+location+of+the+offset-syncs+topic+with+MirrorMaker2
Pull-request:
https://issues.apache.org/jira/browse/KAFKA-12379
https://github.com/apache/kafka/pull/10221
Tim Berglund noted this KIP-716 in Kafka 3.0 release: https://www.youtube.com/watch?v=7SDwWFYnhGA&t=462s
So, to make MM2 to operate on the mm2-offset-syncs topic in the target cluster you should:
set option src->dst.offset-syncs.topic.location = target
manually create mm2-offset-syncs.dst.internal topic in the target cluster
start MM2
src and dst - are examples of aliases, replace it with yours.
Keep in mind: if mm2-offset-syncs.dst.internal topic is not created manually in the target cluster, then MM2 still tries to create this topic in the source cluster.
In case of one-direction replication process this topic is useless, because it is empty all the time, but MM2 requires it anyway.
I am working on kafka mirror making and I am using MM2. I can able to start the mirror process and all topics and date are replicated from source to target cluster.
I need to start the consumers in target cluster from where it has been left off in source cluster. I came across RemoteClusterUtils.translateOffsets to translate the consumer offsets from remote to local.
On checking the code, I can see that it just consumes the checkpoint topic and returning the offset for the consumer group we provided and it does not commit the offset.
Whether we need to explicitly commit the offset using OffsetCommitRequest and start the consumers in target clusters or whether some other way for this?