Running a single kafka s3 sink connector in standalone vs distributed mode - apache-kafka

I have a kafka topic "mytopic" with 10 partitions and want to use S3 sink connector to sink records to an S3 bucket. For scaling purposes it should be running on multiple nodes to write partitions data in parallel to the same S3 bucket.
In Kafka connect user guide and actually many other blogs/tutorials it's recommended to run workers in distributed mode instead of standalone to achieve better scalability and fault tolerance:
... distributed mode is more flexible in terms of scalability and offers the added advantage of a highly available service to minimize downtime.
I want to figure out which mode to choose for my use case: having one logical connector running on multiple nodes in parallel. My understanding is following:
If I run in distributed mode, I will end up having only 1 worker processing all the partitions, since it's considered one connector task.
Instead I should run in standalone mode in multiple nodes. In that case I will have a consumer group and achieve parallel processing of partitions.
In above described standalone scenario I will actually have fault tolerance: if one instance dies, the consumer group will rebalance and other standalone workers will handle the freed partitions.
Is my understaning correct or am I missing something?
Unfortunately I couldn't find much information on this topic other than this google groups discussion, where the author came to the same conclusion as I did.

In theory, that might work, but you'll end up ssh-ing to multiple machines, having basically the same config files, and just not using the connect-distributed command instead of connect-standalone.
You're missing the part about Connect server task rebalancing, though, which communicates over the Connect server REST ports
The underlying task code is all the same, only the entrypoint and offset storage are different. So, why not just use distributed if you have multiple machines?

You don't need to run, multiple instances of standalone processes, the Kafka workers are taking care of distributing the tasks, rebalancing, offset management under the distributed mode, you need to specify the same group id ...

Related

How does distribution mechanism works when Kafka runs locally?

How does distribution mechanism works when Kafka runs locally? Please tell the disadvantages too.
If you only run one broker locally, you have a single point of failure and no processing is truly distributed
If you have multiple brokers on the same machine, and you mount different volumes for each broker process logs, you'd end up with distributed storage + fault tolerance, but still no distributed processing
In either case, you can create as many topics as you want with many partitions, but you can only set the replication factor of the topics to be the number of active brokers
Multiple consumer processes are also able to run fine on a single machine, but you'd get more throughput by separating brokers and consumers across several physical machines (more cpu available, and different network interfaces)

How do Kafka Connect workers allocate manage resource limits (memory/cores) to distribute tasks?

In Kubernetes, you explicitly specify the resource limits for a container. In launching a Kafka connector, you request max tasks but how does the connect worker cluster know how to distribute the load? Does it consider the tasks as equal? Does it use internal metrics?
The Apache Kafka docs and the confluent docs do not explicitly say except Confluent advises the following which would indicate connect workers do not have resource management:
The resource limit depends heavily on the types of connectors being run by the workers, but in most cases users should be aware of CPU and memory bounds when running workers concurrently on a single machine.
https://docs.confluent.io/3.1.2/connect/userguide.html#connect-standalone-v-distributed
Also the cluster deployment appears to require an external resource manager to handle failover of workers.
Kafka Connect workers can be deployed in a number of ways, each with their own benefits. Workers lend themselves well to being run in containers in managed environments such as YARN, Mesos, or Docker Swarm as all state is stored in Kafka, making the local processes themselves stateless. We provide Docker images and documentation for getting started with those images is here. By design, Kafka Connect does not automatically handle restarting or scaling workers which means your existing clustering solutions can continue to be used transparently.
how does the connect worker cluster know how to distribute the load
Each connector can opt to partition its work into tasks (for example, ingesting multiple tables from one database could be done in parallel and so one table would be done by one task), up to the tasks.max limit configured.
Kafka Connect balances these tasks across the available workers such that they are evenly distributed (based on the number of tasks).
The rebalancing protocol changed in release 2.3 of Apache Kafka as part of KIP-415, there are details in the KIP and here. In a nutshell, with incremental cooperative rebalancing Kafka Connect spreads the tasks equally starting from the least loaded workers, eventually including more workers while the load evens out.
Also the cluster deployment appears to require an external resource manager to handle failover of workers.
To be clear - the failover of tasks is done automatically by Kafka Connect, and as you say, the failover of workers would be managed externally.

multiple connectors in kafka to different topics are going to same node

I have created two kafka connectors in kafka-connect which use the same Connector class but have different topics they listen to.
When I launch the process on my node, both the connectors end up creating tasks on this process. However, I would like one node to only handle one connector/topic. How can I limit a topic/connector to a single node? I don't see any configuration in connect-distributed.properties where a process could specify which connector to use.
Thanks
Kafka Connect in distributed mode can run as a cluster of one or more workers. Each worker can run multiple tasks. Depending on how many connectors and workers you are running, you will have tasks running on the same worker. This is deliberate - the idea is that Kafka Connect will manage your tasks and workload for you, across the available workers.
If you want to isolate your processing you can run Kafka Connect as separate Connect clusters, either on the same machine (make sure to use different REST ports), or separate machines.
For more info, see architecture and config for steps to configure separate clusters. Note that a cluster can actually be a single worker, but then you don't have any redundancy in the event of failure.

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.

Scaling Kafka stream application across multiple users

I have a setup where I'm pushing events to kafka and then running a Kafka Streams application on the same cluster. Is it fair to say that the only way to scale the Kafka Streams application is to scale the kafka cluster itself by adding nodes or increasing Partitions?
In that case, how do I ensure that my consumers will not bring down the cluster and ensure that the critical pipelines are always "on". Is there any concept of Topology Priority which can avoid a possible downtime? I want to be able to expose the streams for anyone to build applications on without compromising the core pipelines. If the solution is to setup another kafka cluster, does it make more sense to use Apache storm instead, for all the adhoc queries? (I understand that a lot of consumers could still cause issues with the kafka cluster, but at least the topology processing is isolated now)
It is not recommended to run your Streams application on the same servers as your brokers (even if this is technically possible). Kafka's Streams API offers an application-based approach -- not a cluster-based approach -- because it's a library and not a framework.
It is not required to scale your Kafka cluster to scale your Streams application. In general, the parallelism of a Streams application is limited by the number of partitions of your app's input topics. It is recommended to over-partition your topic (the overhead for this is rather small) to guard against scaling limitations.
Thus, it is even simpler to "offer anyone to build applications" as everyone owns their application. There is no need to submit apps to a cluster. They can be executed anywhere you like (thus, each team can deploy their Streams application the same way by which they deploy any other application they have). Thus, you have many deployment options from a WAR file, over YARN/Mesos, to containers (like Kubernetes). Whatever works best for you.
Even if frameworks like Flink, Storm, or Samza offer cluster management, you can only use such tools that are integrated with those frameworks (for example, Samza requires YARN -- no other options available). Let's say you have already a Mesos setup, you can reuse it for your Kafka Streams applications -- no need for a dedicated "Kafka Streams cluster" (because there is no such thing).
An application’s processor topology is scaled by breaking it into
multiple tasks.
More specifically, Kafka Streams creates a fixed number of tasks based
on the input stream partitions for the application, with each task
assigned a list of partitions from the input streams (i.e., Kafka
topics).
The assignment of partitions to tasks never changes so that each task
is a fixed unit of parallelism of the application. Tasks can then
instantiate their own processor topology based on the assigned
partitions; they also maintain a buffer for each of its assigned
partitions and process messages one-at-a-time from these record
buffers.
As a result stream tasks can be processed independently and in
parallel without manual intervention.
It is important to understand that Kafka Streams is not a resource
manager, but a library that “runs” anywhere its stream processing
application runs. Multiple instances of the application are executed
either on the same machine, or spread across multiple machines and
tasks can be distributed automatically by the library to those running
application instances.
The assignment of partitions to tasks never changes; if an application
instance fails, all its assigned tasks will be restarted on other
instances and continue to consume from the same stream partitions.
The processing of the stream happens in the machines where the application is running.
I recommend you to have a look to this guide, it can help you to better understand the way Kafka Streams work.