Migrate Kafka Topic to new Cluster (and impact on Druid) - apache-kafka

I am ingesting data into Druid from Kafka's topic. Now I want to migrate my Kafka Topic to the new Kafka Cluster. What are the possible ways to do this without duplication of data and without downtime?
I have considered below possible ways to migrate Topic to the new Kafka Cluster.
Manual Migration:
Create a topic with the same configuration in the new Kafka cluster.
Stop pushing data in the Kafka cluster.
Start pushing data in the new cluster.
Stop consuming from the old cluster.
Start consuming from the new cluster.
Produce data in both Kafka clusters:
Create a topic with the same configuration in the new Kafka cluster.
Start producing messages in both Kafka clusters.
Change Kafka topic configration in Druid.
Reset Kafka topic offset in Druid.
Start consuming from the new cluster.
After successful migration, stop producing in the old Kafka cluster.
Use Mirror Maker 2:
MM2 creates Kafka's topic in a new cluster.
Start replicating data in both clusters.
Move producer and consumer to the new Kafka cluster.
The problem with this approach:
Druid manages Kafka topic's offset in its metadata.
MM2 will create two topics with the same name(with prefix) in the new cluster.
Does druid support the topic name with regex?
Note: Druid manages Kafka topic offset in its metadata.
Druid Version: 0.22.1
Old Kafka Cluster Version: 2.0

Maybe a slight modification of your number 1:
Start publishing to the new cluster.
Wait for the current supervisor to catch up all the data in the old topic.
Suspend the supervisor. This will force all the tasks to write and publish the segments. Wait for all the tasks for this supervisor to succeed. This is where "downtime" starts. All of the currently ingested data is still queryable while we switch to the new cluster. New data is being accumulated in the new cluster, but not being ingested in Druid.
All the offset information of the current datasource are stored in Metadata Storage. Delete those records using
delete from druid_dataSource where datasource={name}
Terminate the current supervisor.
Submit the new spec with the new topic and new server information.

You can follow these steps:
1- On your new cluster, create your new topic (the same name or new name, doesn't matter)
2- Change your app config to send messages to new kafka cluster
3- Wait till druid consume all messages from the old kafka, you can ensure when data is being consumed by checking supervisor's lagging and offset info
4- Suspend the task, and wait for the tasks to publish their segment and exit successfully
5- Edit druid's datasource, make sure useEarliestOffset is set to true, change the info to consume from new kafka cluster (and new topic name if it isn't the same)
6- Save the schema and resume the task. Druid will hit the wall when checking the offset, because it cannot find them in new kafka, and then it starts from the beginning

Options 1 and 2 will have downtime and you will lose all data in the existing topic.
Option 2 cannot guarantee you wont lose data or generate duplicates as you try to send messages to multiple clusters together.
There will be no way to migrate the Druid/Kafka offset data to the new cluster without at least trying MM2. You say you can reset the offset in Option 2, so why not do the same with Option 3? I haven't used Druid, but it should be able to support consuming from multiple topics, with pattern or not. With option 3, you don't need to modify any producer code until you are satisfied with the migration process.

Related

Make consumers in the consumer group idempotent while using mirror maker

I am trying to perform Kafka cluster migration from one cluster to another cluster using Mirror Maker 2.0. I am using Kafka 2.8.1 which has good support for consumer-group replication and offset sync.
I have encountered a scenario where I want to migration a topic along with its producers and consumers from source cluster to target cluster.
Example:
I have topic A which I want to migrate, it has 3 partitions
source -> topic = A
destination -> topic = source.A
topic A is replicated as source.A in target cluster
I have a consumer-group topic-a-reader-group created at source cluster with 3 consumers. It is replicated at destination cluster with same name, I have created 1 consumers in this group at destination cluster.
Now when I producer messages at source cluster to topic A, it is consumed by 3 consumers are source cluster and this message will also get replicated at target cluster which is consumer by the consumer present in the consumer-group running at target cluster. Basically the message is read both by source consumer and target consumer. Duplicate read for consumer ultimately.
I want only one consumer to get this message, not to duplicate at source and destination. In my application, I can't just turn off the consumer at source and move it to target cluster(time critical application). So I want to keep both consumers at source and target running for sometime and turn off source consumer after some duration where both are running active.
Is there any idempotence property available for consumer-group which let only one consumer-group to read the message in mirror maker scenario without being duplicated at source and target cluster?
Please suggest if there are any other approach to move consumers from source to target cluster without downtime.

Same consumer group (s3 sink connector) across two different kafka connect cluster

I'm migrating Kafka connectors from an ECS cluster to a new cluster running on Kubernetes. I successfully migrated the Postgres source connectors over by deleting them and recreating them on the exact replication slots. They keep writing to the same topics in the same Kafka cluster. And the S3 connector in the old cluster continues to read from those and write records into S3. Everything works as usual.
But now to move the AWS s3 sink connectors, I first created a non-critical s3 connector in the new cluster with the same name as the one in the old cluster. I was going to wait a few minutes before deleting the old one to avoid missing data. To my surprise, it looks like (based on the UI provided by akhq.io) the one worker on that new s3 connector joins with the existing same consumer group. I was fully expecting to have duplicated data. Based on the Confluent doc,
All Workers in the cluster use the same three internal topics to share
connector configurations, offset data, and status updates. For this
reason all distributed worker configurations in the same Connect
cluster must have matching config.storage.topic, offset.storage.topic,
and status.storage.topic properties.
So from this "same Connect cluster", I thought having the same consumer group id only works within the same connect cluster. But from my observation, it seems like you could have multiple consumers in different clusters belonging to the same consumer group?
Based on this article __consumer_offsets is used by consumers, and unlike other hidden "offset" related topics, it doesn't have any cluster name designation.
Does that mean I could simply create S3 sink connectors in the new Kubernetes cluster and then delete the ones in the ECS cluster without duplicating or missing data then (as long as they have the same name -> same consumer group)? I'm not sure if this is the right pattern people usually use.
I'm not familiar with using a Kafka Connect Cluster but I understand that it is a cluster of connectors that is independent of the Kafka cluster.
In that case, since the connectors are using the same Kafka cluster and you are just moving them from ECS to k8s, it should work as you describe. The consumer offsets information and the internal kafka connect offsets information is stored in the Kafka cluster, so it doesn't really matter where the connectors run as long as they connect to the same Kafka cluster. They should restart from the same position or behave as additional replicas of the same connector regardless of where ther are running.

Kafka MirrorMaker2 automated consumer offset sync

I am using MirrorMaker2 for DR.
Kafka 2.7 should support
automated consumer offset sync
Here is the yaml file I am using (I use strimzi for creating it)
All source cluster topics are replicated in destination cluster.
Also ...checkpoint.internal topic is created in destination cluster that contains all source cluster offsets synced, BUT I don't see these offsets being translated into destination cluster _consumer_offsets topic which means when I will start consumer (same consumer group) in destination cluster it will start reading messages from the beginning.
My expectation is that after allowing automated consumer offsets sync all consumer offsets from source clusters translated and stored in _consumer_offsets topic in the destination cluster.
Can someone please clarify if my expectation is correct and if not how it should work.
The sync.group.offsets.enabled setting is for MirrorCheckpointConnector.
I'm not entirely sure how Strimzi runs MirrorMaker 2 but I think you need to set it like:
checkpointConnector:
config:
checkpoints.topic.replication.factor: 1
sync.group.offsets.enabled: "true"

Migration Cloudera Kafka (CDK) to Apache Kafka

I am looking to migrate a small 4 node Kafka cluster with about 300GB of data on the each brokers to a new cluster. The problem is we are currently running Cloudera's flavor of Kafka (CDK) and we would like to run Apache Kafka. For the most part CDK is very similar to Apache Kafka but I am trying to figure out the best way to migrate. I originally looked at using MirrorMaker, but to my understanding it will re-process messages once we cut over the consumers to the new cluster so I think that is out. I was wondering if we could spin up a new Apache Kafka cluster and add it to the CDK cluster (not sure how this will work yet, if at all) then decommission the CDK server one at a time. Otherwise I am out of ideas other than spinning up a new Apache Kafka cluster and just making code changes to every producer/consumer to point to the new cluster. which I am not really a fan of as it will cause down time.
Currently running 3.1.0 which is equivalent to Apache Kafka 1.0.1
MirrorMaker would copy the data, but not consumer offsets, so they'd be left at their configured auto.offset.reset policies.
I was wondering if we could spin up a new Apache Kafka cluster and add it to the CDK cluster
If possible, that would be the most effective way to migrate the cluster. For each new broker, give it a unique broker ID and the same Zookeeper connection string as the others, then it'll be part of the same cluster.
Then, you'll need to manually run the partition reassignment tool to move all existing topic partitions off of the old brokers and onto the new ones as data will not automatically be replicated
Alternatively, you could try shutting down the CDK cluster, backing up the data directories onto new brokers, then starting the same version of Kafka from your CDK on those new machines (as the stored log format is important).
Also make sure that you backup a copy of the server.properties files for the new brokers

Kafka cluster migration over clouds, how to ensure consumers consume from right offsets when offsets are managed by us?

For migration of Kafka clusters from AWS to AZURE, the challenge is that we are using our custom offsets management for consumers. If I replicate the ZK nodes with offsets, the Kafka Mirror will change those offsets. Is there any way to ensure the offsets are same so that migration can be smooth?
I think the problem might be your custom management. Without more details on this, it's hard to give suggestions.
The problem I see with trying to copy offsets at all is that you consume from cluster A, topic T offset 1000. You copy this to a brand new cluster B, you now have topic T, offset 0. Having consumers starting at offset 1000 will just fail in this scenario, or if at least 1000 messages were mirrored, then you're effectively skipping that data.
With newer versions of Kafka (post 0.10), MirrorMaker uses the the __consumer_offsets topic, not Zookeeper since it's built on newer Java clients.
As for replication tools, uber/uReplicator uses ZooKeeper for offsets.
There are other tools that manage offsets differently, such as Comcast/MirrorTool or salesforce/mirus via the Kafka Connect Framework.
And the enterprise supported tool would be Confluent Replicator, which has unique ways of handling cluster failover and migrations.