I am familiar with basic kafka system. I want to span a single kafka instance across 2 VM's such that some partitions are in one VM and some more in another VM. Please tell me how to configure this kind of system.
What do you mean by "to span kafka instance across 2 VMs" ? What you can due is having two different Kafka instances running on the 2 VMs. They should be configured in order to connect to the same Zookeeper cluster. When you create a new topic with a specific number of partitions, Kafka will span such partitions over the 2 VMs.
Related
Are there any tools service which will allow users to generate Kafka events based on specific schema ? Wanted to do stress testing on my Kafka Topic.
There is one tool i found : kafka-connect-datagen, what here we can't configure the bootstrap server as my topic are spread across multiple cluster ? Looking for some recommendations
my topic are spread across multiple cluster
Kafka topics cannot be spread across more than one cluster. Their partitions can be spread in one cluster, which only needs one bootstrap address.
The DataGen connectors will work fine for single clusters. Run it with different config files for unique clusters.
I am working on setting up the Kafka cluster with a multi DC cluster. The intention is to ensure if one DC goes down, both producers and consumers can still able to continue operations without any issues. I came across two options, but not sure what's the difference and how it works.
Option 1: Setting up multiple zookeeper cluster (one cluster per DC)
Setting up multiple zookeepers and each zookeeper will have a set of brokers in a DC. In this scenario will I really get both Active-Active and Disaster Recovery? If 1 DC goes down what will happen to consumers.
Option 2: Setting up Mirror maker with source and target
I understand it's a replication of one cluster to another. But how do I point to both clusters from a consumer or producer perspective? Will it be handled automatically or something I should do it manually?
Any explanation of these options are appreciated.
We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.
I have setup 1 Zookeeper and 3 Kafka Broker (for Redundancy) on a single machine.
I want to know what is the best practice for Kafka Setup on a single machine and multiple machines in a network.
for e.g. if I set up on a single machine how many zookeeper, brokers and partition I should set up.
Or
If I set on multiple machines (N number of machines) then how many zookeeper, brokers and partition I should set up in respect to N.
As the machine is the main cause of failures, you gain zero by running thee copies on the same machine.
They will all fail at the same time.
I have 3 nodes where Kafka is installed. All these 3 nodes have their own zookeeper instances. Are 3 zookeeper instances required or is 1 zookeeper instance suffice? Should we have multiple zookeeper instances for fault tolerance & in such a scenario would one of the instances act as primary and would others be replica?
I'm not sure what you mean by "All these 3 nodes have their own zookeeper instances" Basically you should have a single cluster of one, three or five Zookeeper instances and all Kafka brokers should use the same cluster. You don't need more than one Zookeeper instance but I'd highly recommend to use three or five instances because of availability. We use three instances of Zookeeper to run our Kafka cluster.