How do Kafka Connect workers allocate manage resource limits (memory/cores) to distribute tasks? - apache-kafka

In Kubernetes, you explicitly specify the resource limits for a container. In launching a Kafka connector, you request max tasks but how does the connect worker cluster know how to distribute the load? Does it consider the tasks as equal? Does it use internal metrics?
The Apache Kafka docs and the confluent docs do not explicitly say except Confluent advises the following which would indicate connect workers do not have resource management:
The resource limit depends heavily on the types of connectors being run by the workers, but in most cases users should be aware of CPU and memory bounds when running workers concurrently on a single machine.
https://docs.confluent.io/3.1.2/connect/userguide.html#connect-standalone-v-distributed
Also the cluster deployment appears to require an external resource manager to handle failover of workers.
Kafka Connect workers can be deployed in a number of ways, each with their own benefits. Workers lend themselves well to being run in containers in managed environments such as YARN, Mesos, or Docker Swarm as all state is stored in Kafka, making the local processes themselves stateless. We provide Docker images and documentation for getting started with those images is here. By design, Kafka Connect does not automatically handle restarting or scaling workers which means your existing clustering solutions can continue to be used transparently.

how does the connect worker cluster know how to distribute the load
Each connector can opt to partition its work into tasks (for example, ingesting multiple tables from one database could be done in parallel and so one table would be done by one task), up to the tasks.max limit configured.
Kafka Connect balances these tasks across the available workers such that they are evenly distributed (based on the number of tasks).
The rebalancing protocol changed in release 2.3 of Apache Kafka as part of KIP-415, there are details in the KIP and here. In a nutshell, with incremental cooperative rebalancing Kafka Connect spreads the tasks equally starting from the least loaded workers, eventually including more workers while the load evens out.
Also the cluster deployment appears to require an external resource manager to handle failover of workers.
To be clear - the failover of tasks is done automatically by Kafka Connect, and as you say, the failover of workers would be managed externally.

Related

Building a Kafka Cluster using two servers only

I'm planning to build a Kafka Cluster using two servers, and host Zookeeper on these two servers as well.
The Question is, since Kafka requires Zookeeper to run, what is the best cluster build for zookeeper to implement Kafka Cluster on two servers?
for eg. I'm currently running two zookeepers on both servers and one Kafka on each server, and in the Kafka configuration they point to all Zookeepers.
Is there a better way to do this?
First of all, you don't have to setup Zookeper and Kafka in the same server. One of the roles of Zookeeper is electing controller. (one of the brokers which is responsible for maintaining the leader/follower relationship for all the partitions) For election; majority of Zookeper nodes must be alive. In your case even one Zookeeper instance is down, you cannot select controller. So there is no difference between having one Zookeper or two. That's why it is recommended to have at least 3 nodes in Zookeeper cluster. By this way you can handle failure of one Zookeeper node.
An addition to this, it is highly recommended to have at least three brokers in your Kafka cluster to maintain both consistency and high availability. (link1, link2)
UPDATE:
As long as you are limited to only two servers, then you can consider sacrificing from high availability by set up your broker by setting min.insync.replicas=2 and having topics with replication.factor=2. If HA is more important than data loss, then you can use min.insync.replicas=1 (default) broker config with again topic replication.factor=2. In this circumstance, your options are these IMHO. (Having one or two Zookeepers is not important as I mentioned above)
I am often faced with the same problem as you do #frisky5 where i would like to achieve a "suboptimal" HA system using only 2 nodes, and thus workarounds are always needed with cloud-native frameworks that rely on the assumption that clusters will have lot of nodes available.
That ain't always the case in real life, is it ;) ?
That being said, i see you essentially having 2 options:
Externalize zookeeper configuration on a replicated storage system using 2 nodes (e.g. DRBD)
Replicate Kafka data volumes entirely on the second nodes and use 2 one-node Kafka clusters that you switch on and off depending on who is the current master node.
I would go for the first option. In that case you would have 2 Kafka servers and one zookeeper server whose ip needs to be static (virtual ip). When the zookeeper node goes down, it is restarted one the second node with same VIP, but it needs to access the synchronized data folder.
I am not too familiar with zookeepers internals and i can't tell you whether it will go in conflict when starting up on a data store who "wasn't its own" but i would guess it makes sense for you to test it using a simple rsync setup.
Another way to achieve consensus if you are using a k3s based kubernetes cluster would be to rely on internal k8s distributed consensus mechanics to "tell Kafka" which node is the leader. This works for the postgresoperator by chruncydata because Patroni is cool ( https://patroni.readthedocs.io/en/latest/kubernetes.html ) 😎 but i am not sure if Kafka/zookeeper are that flexible and can communicate with a rest API to set their locks ...
Once you have achieved this intermediate step, then you can use a PostgreSQL db as external source of truth for k3s and then it is as simple as syncing the postgres data folder between the machines (easily done with rsync). The beauty of this approach is that it is way more generic and could be used for other systems too.
Let me know what do you think about these two approaches and whether you manage to setup a test environment. If you do on GitHub i can help you out with implementation

Running a single kafka s3 sink connector in standalone vs distributed mode

I have a kafka topic "mytopic" with 10 partitions and want to use S3 sink connector to sink records to an S3 bucket. For scaling purposes it should be running on multiple nodes to write partitions data in parallel to the same S3 bucket.
In Kafka connect user guide and actually many other blogs/tutorials it's recommended to run workers in distributed mode instead of standalone to achieve better scalability and fault tolerance:
... distributed mode is more flexible in terms of scalability and offers the added advantage of a highly available service to minimize downtime.
I want to figure out which mode to choose for my use case: having one logical connector running on multiple nodes in parallel. My understanding is following:
If I run in distributed mode, I will end up having only 1 worker processing all the partitions, since it's considered one connector task.
Instead I should run in standalone mode in multiple nodes. In that case I will have a consumer group and achieve parallel processing of partitions.
In above described standalone scenario I will actually have fault tolerance: if one instance dies, the consumer group will rebalance and other standalone workers will handle the freed partitions.
Is my understaning correct or am I missing something?
Unfortunately I couldn't find much information on this topic other than this google groups discussion, where the author came to the same conclusion as I did.
In theory, that might work, but you'll end up ssh-ing to multiple machines, having basically the same config files, and just not using the connect-distributed command instead of connect-standalone.
You're missing the part about Connect server task rebalancing, though, which communicates over the Connect server REST ports
The underlying task code is all the same, only the entrypoint and offset storage are different. So, why not just use distributed if you have multiple machines?
You don't need to run, multiple instances of standalone processes, the Kafka workers are taking care of distributing the tasks, rebalancing, offset management under the distributed mode, you need to specify the same group id ...

Is a Kafka Connect worker a machine/server or just a cpu core?

In the docs Kafka Connect workers are described as processes, so in my understanding cores of cpu.
But in the same docs they are meant to provide automatic fault tolerance (in their distributed mode), so in my understanding different machines, since fault tolerance at process level is meaningless imo.
Somebody could enlighten me please ?
A Kafka Connect worker is a JVM process.
You can run multiple Kafka Connect workers in distributed mode configured as a cluster, and if one worker dies the work (tasks) are distributed amongst the remaining workers.
Typically you would deploy one Kafka Connect worker per machine. Running multiple Kafka Connect workers in distributed mode on one machine is not something that would generally make sense IMO.
I have not tested it but I don't believe that a Kafka Connect worker is tied to one CPU.
For more explanation see here: https://youtu.be/oNK3lB8Z-ZA?t=1337 (slides: https://rmoff.dev/bbuzz19-kafka-connect)

multiple connectors in kafka to different topics are going to same node

I have created two kafka connectors in kafka-connect which use the same Connector class but have different topics they listen to.
When I launch the process on my node, both the connectors end up creating tasks on this process. However, I would like one node to only handle one connector/topic. How can I limit a topic/connector to a single node? I don't see any configuration in connect-distributed.properties where a process could specify which connector to use.
Thanks
Kafka Connect in distributed mode can run as a cluster of one or more workers. Each worker can run multiple tasks. Depending on how many connectors and workers you are running, you will have tasks running on the same worker. This is deliberate - the idea is that Kafka Connect will manage your tasks and workload for you, across the available workers.
If you want to isolate your processing you can run Kafka Connect as separate Connect clusters, either on the same machine (make sure to use different REST ports), or separate machines.
For more info, see architecture and config for steps to configure separate clusters. Note that a cluster can actually be a single worker, but then you don't have any redundancy in the event of failure.

How to set kafka-connect connector and task's JVM heap size?

Does kafka connect start new connector and its tasks within kafka connect process? or a new JVM process will be forked.
If it starts plugin within kafka connect process, then I need set kafka connect JVM heap size via KAFKA_CONNECT_JVM_HEAP_OPT (using confluent docker image). Then the problem is, if I start many tasks or many connectors, they will share the JVM heap, so it is hard to decide the heap size of kafka connect.
If for each connector, kafka connect starts them in a new JVM process, how can I set the heap size for them?
Kafka Connect has basic support for multi-tenancy. Specifically, you are able to bundle several connector instances within the same Connect worker.
Each Connect worker always maps to a single JVM instance. A request to start a new connector does not result into spawning a new JVM instance. But Connect workers with the same group.id form a cluster of Connect workers. Then, connector tasks are distributed among the workers in the Connect cluster.
A Connect worker's heap size can be easily set using:
export KAFKA_HEAP_OPTS="-Xms256M -Xmx2G" (this example uses the default values)
or, when a docker image is used, by setting:
-e CONNECT_KAFKA_HEAP_OPTS="-Xms256M -Xmx2G" (again this example uses the default values)
Connect workers can be scaled horizontally. Adding more workers in a Connect cluster adds memory and computing resources to your deployment. If you need to apply a more specific and tight memory budget to your Connect deployment, you might chose to group specific connectors to each Connect cluster, or even in some cases deploy one connector instance per Connect cluster.
All tasks share the memory space within one worker's host OS, whether that's a container doesn't really matter (other than the fact without JVM flags on the process inside container, it's limited even further)
You "add memory" to your Connect cluster by adding more workers. You prevent OOM errors by increasing topic partitions, Connector tasks, reducing poll/batch amounts, and reducing the overall amount of data each worker needs to read.
The environment variable for Connect's heap settings is KAFKA_HEAP_OPTS, and you can add more JVM flags from KAFKA_OPTS