Batch creation of Kafka Connector - apache-kafka

I'm trying to create 1000 Connectors, each one with one task, his own consumer group and unique topic in my Kafka Kubernetes cluster
(My end goal after creating the connectors, is sending a lot of requests to the connectors' topics and measure performances for our implemented connector sink).
Every creation triggers rebalancing across the cluster which "blocks" the Connector RestAPI (returns 409 for everything) and shutting down tasks.
Therefore I have three questions:
Is the rebalancing a sort of downtime for the connector (As I said, there are tasks shutting down and restart while rebalancing and one task for connector)?
Can I configure the rebalancing schedule?
Is there a way of creating Connectors in batches so it would be fast (say creating 100 connectors in less than one second) and won't cause downtime (if the answer for the first question is yes)?

One way around the problem would be to start 1000 Connect Clusters (say, via Docker Orchestration), all with one or few Connectors.
There isn't a way around the rebalancing. You're adding consumers to the same consumer group, so that'll always rebalance.
Rather than running one task per connector, I would suggest grouping multiple topics/tasks together than share similar configurations, that way you limit how much rebalancing would be done.

Related

What is the relationship between connectors and tasks in Kafka Connect?

We've been using Kafka Connect for a while on a project, currently entirely using only the Confluent Kafka Connect JDBC connector. I'm struggling to understand the role of 'tasks' in Kafka Connect, and specifically with this connector. I understand 'connectors'; they encompass a bunch of configuration about a particular source/sink and the topics they connect from/to. I understand that there's a 1:Many relationship between connectors and tasks, and the general principle that tasks are used to parallelize work. However, how can we understand when a connector will/might create multiple tasks?
In the source connector case, we are using the JDBC connector to pick up source data by timestamp and/or a primary key, and so this seems in its very nature sequential. Indeed, all of our source connectors only ever seem to have one task. What would ever trigger Kafka Connect to create more than one connector? Currently we are running Kafka Connect in distributed mode, but only with one worker; if we had multiple workers, might we get multiple tasks per connector, or are the two not related?
In the sink connector case, we are explicitly configuring each of our sink connectors with tasks.max=1, and so unsurprisingly we only ever see one task for each connector there too. If we removed that configuration, presumably we could/would get more than one task. Would this mean the messages on our input topic might be consumed out of sequence? In which case, how is data consistency for changes assured?
Also, from time to time, we have seen situations where a single connector and task will both enter the FAILED state (because of input connectivity issues). Restarting the task will remove it from this state, and restart the flow of data, but the connector remains in FAILED state. How can this be - isn't the connector's state just the aggregate of all its child tasks?
A task is a thread that performs the actual sourcing or sinking of data.
The number of tasks per connector is determined by the implementation of the connector. Take a Debezium source connector to MySQL as an example, since one MySQL instance writes to exactly one binlog file at a time and a file has to be read sequentially, one connector generates exactly one task.
Whereas for sink connectors, the number of tasks should be equal to the number of partitions of the topic.
The task distribution among workers is determined by task rebalance which is a very similar process to Kafka consumer group rebalance.

Running a single kafka s3 sink connector in standalone vs distributed mode

I have a kafka topic "mytopic" with 10 partitions and want to use S3 sink connector to sink records to an S3 bucket. For scaling purposes it should be running on multiple nodes to write partitions data in parallel to the same S3 bucket.
In Kafka connect user guide and actually many other blogs/tutorials it's recommended to run workers in distributed mode instead of standalone to achieve better scalability and fault tolerance:
... distributed mode is more flexible in terms of scalability and offers the added advantage of a highly available service to minimize downtime.
I want to figure out which mode to choose for my use case: having one logical connector running on multiple nodes in parallel. My understanding is following:
If I run in distributed mode, I will end up having only 1 worker processing all the partitions, since it's considered one connector task.
Instead I should run in standalone mode in multiple nodes. In that case I will have a consumer group and achieve parallel processing of partitions.
In above described standalone scenario I will actually have fault tolerance: if one instance dies, the consumer group will rebalance and other standalone workers will handle the freed partitions.
Is my understaning correct or am I missing something?
Unfortunately I couldn't find much information on this topic other than this google groups discussion, where the author came to the same conclusion as I did.
In theory, that might work, but you'll end up ssh-ing to multiple machines, having basically the same config files, and just not using the connect-distributed command instead of connect-standalone.
You're missing the part about Connect server task rebalancing, though, which communicates over the Connect server REST ports
The underlying task code is all the same, only the entrypoint and offset storage are different. So, why not just use distributed if you have multiple machines?
You don't need to run, multiple instances of standalone processes, the Kafka workers are taking care of distributing the tasks, rebalancing, offset management under the distributed mode, you need to specify the same group id ...

What happens to consumer groups in Kafka if the entire cluster goes down?

We have a consumer service that is always trying to read data from a topic using a consumer group. Due to redeployments, our Kafka cluster periodically is brought down and recreated again.
Whenever the cluster comes back again, we observed that although the previous topics are picked up (probably from zookeeper), the previous consumer groups are not created. Because of this, our running consumer process which is created with a previous consumer group gets stuck and never comes out.
Is this how the behavior of the consumer groups should be or is there a configuration we need to enable somewhere?
Any help is greatly appreciated.
Kafka Brokers keep a cache of healthy consumers and consumer groups, if the entire cluster is destroyed/recreated it no longer has knowledge of those consumers and groups, including offsets. The consumers will have to reconnect and re-establish the group and offsets from the beginning of the topic.
Operationally it makes more sense to keep the Kafka cluster running long-term, and do version upgrades in a rolling fashion so you don't interrupt the service.

Kafka Producer, multi DC failover support

I have two distinct kafka clusters located in different data centers - DC1 and DC2. How to organize kafka producer failover between two DCs? If primary kafka cluster (DC1) becomes unavailable, I want producer to switch to failover kafka cluster (DC2) and continue publishing to it? Producer also should be able to switch back to primary cluster, once it is available. Any good patterns, existing libs, approaches, code examples?
Each partition of the Kafka topic your producer is publishing to has a separate leader, often spread across multiple brokers in the cluster, so the producer is connected to many “primary” brokers simultaneously. Should any one of them fail another In Sync Replica (ISR) will be elected as leader and automatically take over. You do not need to do anything in your client app for it to reconnect to the new leader(s), retry any failed requests, and continue.
If this is for Multi-Data Center (MDC) failover then things get much more complicated depending on if the client apps die as well or if they keep running and need just their cluster connections to failover. Offsets are not preserved across multiple Kafka clusters so while producers are simpler, consumers need to call GetOffsetsForTimes() upon failover.
For a great write up of the the MDC failover modes and best practices see the MDC Whitepaper here: https://www.confluent.io/white-paper/disaster-recovery-for-multi-datacenter-apache-kafka-deployments/
Since you asked only about producers, your app can detect if the primary cluster is down (say for a certain number of retries) and then instead of attempting to reconnect, it can instead connect to another brokerlist from the secondary cluster. Alternatively you can redirect the dns name of the brokerlist hosts to point to the secondary cluster.

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.