consolidate and migrate multiple kafka clusters to 1 cluster - apache-kafka

I have 2 onprem kafka clusters on 2 environment dev and test (they have the same topic names). Now I want to consolidate them into only one cluster (aws msk). I would like my new kafka cluster to have both environment topics. They will be differentiated by the prefix in their names. Example: dev_topicA, test_topicA. Is that posible?

It is possible with MirrorMaker2:
run a connector of MirrorSourceConnector with properties source.cluster.alias=dev and source cluster bootstrap servers to dev cluster
run another connector of MirrorSourceConnector with properties source.cluster.alias=test and source cluster bootstrap servers to test clusters
in both connectors use target cluster bootstrap as your MSK cluster
The default replication naming policy is the one to add aliases.
Default separator is .. If you would like to have _ like in your example, override default separator: replication.policy.separator=_

Related

confluent_kafka, how to specify cluster for producer config

I have two environments in a dev Confluent Cloud account, each with a single cluster. I can see that both clusters have the same bootstrap server, and this is documented:
In the Confluent Cloud Console, you may see the same bootstrap server
for different clusters. This is working as designed; it occurs because
Confluent Cloud clusters are multi-tenant.
My problem is that when attempting to produce to a topic it appears that the producer is connected to the wrong cluster, I'll get:
cimpl.KafkaException: KafkaError{code=_UNKNOWN_TOPIC,val=-188,str="Unable to produce message: Local: Unknown topic"}
And producer.list_topics() shows the topics from the other cluster to the one I'm working on.
So how do I specify the exact cluster, which will have the right topics? I'm expecting to be able to provide cluster.id in my configuration. But that returns
KafkaError{code=_INVALID_ARG,val=-186,str="No such configuration property: "cluster.id""}

Same consumer group (s3 sink connector) across two different kafka connect cluster

I'm migrating Kafka connectors from an ECS cluster to a new cluster running on Kubernetes. I successfully migrated the Postgres source connectors over by deleting them and recreating them on the exact replication slots. They keep writing to the same topics in the same Kafka cluster. And the S3 connector in the old cluster continues to read from those and write records into S3. Everything works as usual.
But now to move the AWS s3 sink connectors, I first created a non-critical s3 connector in the new cluster with the same name as the one in the old cluster. I was going to wait a few minutes before deleting the old one to avoid missing data. To my surprise, it looks like (based on the UI provided by akhq.io) the one worker on that new s3 connector joins with the existing same consumer group. I was fully expecting to have duplicated data. Based on the Confluent doc,
All Workers in the cluster use the same three internal topics to share
connector configurations, offset data, and status updates. For this
reason all distributed worker configurations in the same Connect
cluster must have matching config.storage.topic, offset.storage.topic,
and status.storage.topic properties.
So from this "same Connect cluster", I thought having the same consumer group id only works within the same connect cluster. But from my observation, it seems like you could have multiple consumers in different clusters belonging to the same consumer group?
Based on this article __consumer_offsets is used by consumers, and unlike other hidden "offset" related topics, it doesn't have any cluster name designation.
Does that mean I could simply create S3 sink connectors in the new Kubernetes cluster and then delete the ones in the ECS cluster without duplicating or missing data then (as long as they have the same name -> same consumer group)? I'm not sure if this is the right pattern people usually use.
I'm not familiar with using a Kafka Connect Cluster but I understand that it is a cluster of connectors that is independent of the Kafka cluster.
In that case, since the connectors are using the same Kafka cluster and you are just moving them from ECS to k8s, it should work as you describe. The consumer offsets information and the internal kafka connect offsets information is stored in the Kafka cluster, so it doesn't really matter where the connectors run as long as they connect to the same Kafka cluster. They should restart from the same position or behave as additional replicas of the same connector regardless of where ther are running.

How can I configure my connector to run in specific worker group in multicluster connect environment in distributed kafka connect?

As per the documentation, worker service is set to run before adding connectors. Suppose I am running worker-a with group.id "cluster-a" and worker-b with group.id "cluster-b" on three distributed VM's. What is the configuration that makes connectors to choose their worker group.
Suppose I need to configure debezium mysql connector's tasks to run on cluster-a and jdbc connector's all tasks on cluster-b. How should I do it?
Thanks in advance.
Each Connector group specified by group.id communicate with each other over HTTP via their rest.advertised.listener (similar to the brokers). Each Connect cluster also requires its own unique config, offsets, and status topics
You'd HTTP POST to one of the group's rest.port endpoints, and the tasks will be distributed within that group
However, if you only have 3 machines, there's really no need to setup two unique Connect clusters (JDBC and Debezium tasks can run in the same cluster)

Confluent Schema Registry Master

For a cross network confluent platform, we have one kafka cluster on-premise and another on AWS in which data is replicated from on-prem to AWS using mirror maker. Both clusters are independent with their own schema-registry, rest proxy and connect.Both clusters have different set of producers and consumers and selective topics are being mirrored between clusters.
What should be the best practice to deploy schema-registry ? Should we have one master (say on-premise) and others as non-eligible masters on on-prem and AWS ?
We suspect schema-registry can have issues with respect to schema ids when topics are replicated between clusters and we have 2 masters (aws and onprem).
Thanks!
If you use two different master registries, I find that would be difficult to manage. (See mistake #2 for self-managed registries). The purpose of master.eligble=false on a second instance/cluster is that all ID registration events have a single source of truth. As the docs say, The Schema Registry nodes in both datacenters link to the primary Kafka cluster in DC A, so you would need to establish a valid network link between AWS and onprem, anyway.
Otherwise, with multiple masters, you will need to mirror the schemas topic if you want exact same subjects and schema ids between environments. However, this is primarily meant to be used as a backup, and you would eventually run into conflicting schema IDs for any producer in the destination region pushing schemas to the other master. Hence why the first diagram shows only consumers in the remote datacenter.
If you do not do this, then let's say you mirrored a topic from cluster A to cluster B, and the consumer used registry B in the settings, it would attempt to lookup an ID from registry A (which is embedded in the message), and that either would not exist or would be an incorrect ID for the topic being read.
I wrote a Kafka Connect plugin to work around that issue by registering a new ID in a remote master registry - https://github.com/cricket007/schema-registry-transfer-smt , though you said you're using MirrorMaker, so you would need to take the logic there and apply it to the MessageHandler interface in MirrorMaker
I've really only worked with one master, on-prem, and in AWS, the registry settings have Zookeeper connection pointing to the on-prem cluster settings.
And we don't mirror everything as the docs suggest, only specific topics. The purpose of using Replicator rather than MirrorMaker is that consumer failover is better supported, rather than simply getting data "over the wire", your clients are less dependent upon where they are running as well.

2 cluster of zookeper servers in hadoop+kafka cluster - is it posible?

We have Kafka cluster with the following details
3 kafka machines
3 zookeeper servers
We also have Hadoop cluster that includes datanode machines
And all application are using the zookeeper servers, including the kafka machines
Now
We want to do the following changes
We want to add additional 3 zookeeper servers that will be in a separate cluster
And only kafka machine will use this additional zookeeper servers
Is it possible ?
Editing the ha.zookeeper.quorum in Hadoop configurations to be separate from zookeeper.connect in Kafka configurations, such that you have two individual Zookeeper clusters, can be achieved, yes.
However, I don't think Ambari or Cloudera Manager, for example, allow you to view or configure more than one Zookeeper cluster at a time.
Yes, that's possible. Kafka uses Zookeeper to perform various distributed coordination tasks, such as deciding which Kafka broker is responsible for allocating partition leaders, and storing metadata on topics in the broker.
After closing kafka, the original zookeeper cluster data will be copied to the new cluster using tools, this is a zookeeper cluster data transfer util zkcopy
But if your Kafka cluster didn't stop work, you should think about Zookeeper data transfer to additional zookeeper servers.