How does distribution mechanism works when Kafka runs locally? - apache-kafka

How does distribution mechanism works when Kafka runs locally? Please tell the disadvantages too.

If you only run one broker locally, you have a single point of failure and no processing is truly distributed
If you have multiple brokers on the same machine, and you mount different volumes for each broker process logs, you'd end up with distributed storage + fault tolerance, but still no distributed processing
In either case, you can create as many topics as you want with many partitions, but you can only set the replication factor of the topics to be the number of active brokers
Multiple consumer processes are also able to run fine on a single machine, but you'd get more throughput by separating brokers and consumers across several physical machines (more cpu available, and different network interfaces)

Related

Running a single kafka s3 sink connector in standalone vs distributed mode

I have a kafka topic "mytopic" with 10 partitions and want to use S3 sink connector to sink records to an S3 bucket. For scaling purposes it should be running on multiple nodes to write partitions data in parallel to the same S3 bucket.
In Kafka connect user guide and actually many other blogs/tutorials it's recommended to run workers in distributed mode instead of standalone to achieve better scalability and fault tolerance:
... distributed mode is more flexible in terms of scalability and offers the added advantage of a highly available service to minimize downtime.
I want to figure out which mode to choose for my use case: having one logical connector running on multiple nodes in parallel. My understanding is following:
If I run in distributed mode, I will end up having only 1 worker processing all the partitions, since it's considered one connector task.
Instead I should run in standalone mode in multiple nodes. In that case I will have a consumer group and achieve parallel processing of partitions.
In above described standalone scenario I will actually have fault tolerance: if one instance dies, the consumer group will rebalance and other standalone workers will handle the freed partitions.
Is my understaning correct or am I missing something?
Unfortunately I couldn't find much information on this topic other than this google groups discussion, where the author came to the same conclusion as I did.
In theory, that might work, but you'll end up ssh-ing to multiple machines, having basically the same config files, and just not using the connect-distributed command instead of connect-standalone.
You're missing the part about Connect server task rebalancing, though, which communicates over the Connect server REST ports
The underlying task code is all the same, only the entrypoint and offset storage are different. So, why not just use distributed if you have multiple machines?
You don't need to run, multiple instances of standalone processes, the Kafka workers are taking care of distributing the tasks, rebalancing, offset management under the distributed mode, you need to specify the same group id ...

Is a Kafka Connect worker a machine/server or just a cpu core?

In the docs Kafka Connect workers are described as processes, so in my understanding cores of cpu.
But in the same docs they are meant to provide automatic fault tolerance (in their distributed mode), so in my understanding different machines, since fault tolerance at process level is meaningless imo.
Somebody could enlighten me please ?
A Kafka Connect worker is a JVM process.
You can run multiple Kafka Connect workers in distributed mode configured as a cluster, and if one worker dies the work (tasks) are distributed amongst the remaining workers.
Typically you would deploy one Kafka Connect worker per machine. Running multiple Kafka Connect workers in distributed mode on one machine is not something that would generally make sense IMO.
I have not tested it but I don't believe that a Kafka Connect worker is tied to one CPU.
For more explanation see here: https://youtu.be/oNK3lB8Z-ZA?t=1337 (slides: https://rmoff.dev/bbuzz19-kafka-connect)

Kafka cluster with single broker

I'm looking to start using Kafka for a system and I'm trying to cover all use cases.
Normally it would be run as a cluster of brokers running on virtual servers (replication factor 3-5). but some customers though don't care about resilience and a broker failure needing a manual reboot of the whole system is fine with them, they just care about hardware costs.
So my question is, are there any issues with using Kafka as a single broker system for small installations with low throughput?
Cheers
It's absolutely OK to use a single Kafka broker. Note, however, that with a single broker you won't have a highly available service meaning that when the broker fails you will have a downtime.
Your replication-factor will be limited to 1 and therefore all of the partitions of a topic will be stored on the same node.
For a proof-of-concept or non-critical dev work, a single node cluster works just fine. However having a cluster has multiple benefits. It's okay to go with a single node cluster if the following are not important/relevant for you.
scalability [spreads load across multiple brokers to maintain certain throughput]
fail-over [guards against data loss in case one/more node(s) go down]
availability [system remains reachable and functioning even if one/more node(s) go down]

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.

Kafka Resiliency (Clustering) setup on single and multiple machines

I have setup 1 Zookeeper and 3 Kafka Broker (for Redundancy) on a single machine.
I want to know what is the best practice for Kafka Setup on a single machine and multiple machines in a network.
for e.g. if I set up on a single machine how many zookeeper, brokers and partition I should set up.
Or
If I set on multiple machines (N number of machines) then how many zookeeper, brokers and partition I should set up in respect to N.
As the machine is the main cause of failures, you gain zero by running thee copies on the same machine.
They will all fail at the same time.