Kafka connect multiple workers with same connect topics - apache-kafka

I'm running the Kafka connect in distributed mode. There are 3 workers. All of them having same configuration like group id and connect topic names (connect offset, status, config).
My use case is running a Debezium connector (6 connectors) to extract data from 6 different MySQL servers.
Is it a good practice to maintain them in the same Kafka topic (I mean the offsets and all)?

Related

Kafka Connect best practices for topic compaction

I am using Debezium which makes of Kafka Connect.
Kafka Connect exposes a couple of topics that need to be created:
OFFSET_STORAGE_TOPIC
This environment variable is required when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector offsets. The topic should have many partitions, be highly replicated (e.g., 3x or more) and should be configured for compaction.
STATUS_STORAGE_TOPIC
This environment variable should be provided when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector status. The topic can have multiple partitions, should be highly replicated (e.g., 3x or more) and should be configured for compaction.
Does anyone have any specific recommended compaction configs for these topics?
e.g.
is it enough to set just:
cleanup.policy: compact
unclean.leader.election.enable: true
or also:
min.compaction.lag.ms: 60000
segment.ms: 1800000
min.cleanable.dirty.ratio: 0.01
delete.retention.ms: 100
The defaults should be fine, and Connect will create/configure those topics on its own unless you preconfigure those topics with those settings.
These are the only cases when I can think of when to adjust the compaction settings
a connect-group lingering on the topic longer than you want it to be. For example, a source connector doesn't start immediately after a long downtime because it's processing the offsets topic
your Connect cluster doesn't accurately report its state, or the tasks do not rebalance appropriately (because the status topic is in a bad state)
The __consumer_offsets (compacted) topic is what is used for Sink connectors, and would be configured separately for all consumers, not only Connect

Using Kafka Connect in distributed mode, where are internal topics supposed to exist

As a follow up to my previous question here Attempting to run Kafka Connect in distributed mode locally, problem with internal topics, I have started to figure out what might really be going on (I'm learning Kafka as I go).
Kafka Connect, one way or another, requires three internal topics: config, offset, and status. Are these topics supposed to exist in the Kafka cluster where I am consuming data from? For context, what I'm doing is someone else has a Kafka cluster set up that has topics (messages?) for me to consume. I spin up a Kafka Connect cluster on my local machine (to test) and this local instance (we'll call it that going forward) then connects to the remote Kafka cluster (we'll call it the remote cluster) by way of me typing in the bootstrap servers, some callback handler classes, and a connect.jaas file.
Do these three topics need to already exist on the remote cluster? Here I have been trying to create them on my own broker on my local instance, but through continued research, I'm seeing maybe these three internal topics need to be on the remote cluster (where I'm getting my data from). Does the owner of the remote Kafka cluster need to create these three topics for me? Where would they create them exactly? What if their cluster is not a Kafka Connect cluster specifically?
The topics need to be created on the cluster defined by bootstrap.servers in the Connect worker properties. This can be local or remote, depending on what data you actually want the connector tasks to send/receive. Individual connect tasks cannot override what brokers are being used (not possible to use a source connector to write to multiple Kafka clusters, for example)
Latest versions of Kafka Connect will automatically create those internal topics, if it is authorized to do so. Otherwise, yes, they'll need to be created using kafka-topics --create with appropriate partition counts and replication factors.
If your data exists in a remote Kafka cluster, the only reason to run a local instance is if you want to use MirrorMaker, for example.
What if their cluster is not a Kafka Connect cluster specifically?
Unclear what this means. Kafka Connect is a client just like a Kafka Streams app or normal producer or consumer. It doesn't store topics itself.

In Kafka Connect, how to connect with multiple kafka clusters?

I set the kafka connect cluster in distributed mode and I wanna get connections with multiple kafka CLUSTERS, not just multiple brokers.
Target brokers can be set with bootstrap.servers in connect-distributed.properties.
So, at first, I set broker1 from kafka-cluster-A like below:
bootstrap.servers=broker1:9092
Absolutely, it worked well.
And then, I added broker2 from kafka-cluster-B like below:
bootstrap.servers=broker1:9092,broker2:9092
So, these two brokers are in the different clusters.
And this didn't work at all.
Without any error, it was just stuck and there was no answer with the request like creating connector through the REST API.
How can I connect with multiple kafka clusters?
As far as I know, you can only connect a Kafka Connect worker to one Kafka cluster.
If you have data on different clusters that you want to handle with Kafka Connect then run multiple Kafka Connect worker processes.

kafka machines in the cluster and kafka communications

We have kafka cluster with 3 kafka brokers nodes and 3 zookeepers servers
kafka version - 10.1 ( hortonworks )
from my understanding since all meta data is located on the zookeeper servers , and kafka brokers are using this data ( kafka talk with zookeeper server via port 2181 )
I just wondering if each kafka machine talk with other kafka in the cluster , or maybe kafka are get/put the data only on/from the zookeepers servers ?
So dose kafka service need to communicate with other kafka in the cluster ? ,
Or maybe kafka machines get all is need only from the zookeepers server ?
Kafka brokers certainly need to communicate with each other, most importantly to replica data. Data produced to Kafka is replicated across brokers for fault-tolerance and data durability. Partition followers send FetchRequests to partition leaders in order to replicate the data.
Additionally, the Controller broker sends a LeaderAndIsr request to brokers whenever a partition leader/follower is changed - that's how it informs brokers to start leading a partition or replicating it.
I would recommend these two introductory articles of mine in order to help you get more context:
https://hackernoon.com/thorough-introduction-to-apache-kafka-6fbf2989bbc1
https://hackernoon.com/apache-kafkas-distributed-system-firefighter-the-controller-broker-1afca1eae302

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.