Kafka Resiliency (Clustering) setup on single and multiple machines - apache-kafka

I have setup 1 Zookeeper and 3 Kafka Broker (for Redundancy) on a single machine.
I want to know what is the best practice for Kafka Setup on a single machine and multiple machines in a network.
for e.g. if I set up on a single machine how many zookeeper, brokers and partition I should set up.
Or
If I set on multiple machines (N number of machines) then how many zookeeper, brokers and partition I should set up in respect to N.

As the machine is the main cause of failures, you gain zero by running thee copies on the same machine.
They will all fail at the same time.

Related

How does distribution mechanism works when Kafka runs locally?

How does distribution mechanism works when Kafka runs locally? Please tell the disadvantages too.
If you only run one broker locally, you have a single point of failure and no processing is truly distributed
If you have multiple brokers on the same machine, and you mount different volumes for each broker process logs, you'd end up with distributed storage + fault tolerance, but still no distributed processing
In either case, you can create as many topics as you want with many partitions, but you can only set the replication factor of the topics to be the number of active brokers
Multiple consumer processes are also able to run fine on a single machine, but you'd get more throughput by separating brokers and consumers across several physical machines (more cpu available, and different network interfaces)

Setting up multi datacenter Kafka cluster

I am working on setting up the Kafka cluster with a multi DC cluster. The intention is to ensure if one DC goes down, both producers and consumers can still able to continue operations without any issues. I came across two options, but not sure what's the difference and how it works.
Option 1: Setting up multiple zookeeper cluster (one cluster per DC)
Setting up multiple zookeepers and each zookeeper will have a set of brokers in a DC. In this scenario will I really get both Active-Active and Disaster Recovery? If 1 DC goes down what will happen to consumers.
Option 2: Setting up Mirror maker with source and target
I understand it's a replication of one cluster to another. But how do I point to both clusters from a consumer or producer perspective? Will it be handled automatically or something I should do it manually?
Any explanation of these options are appreciated.

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.

Separate zookeeper install or not using kafka 10.2?

I would like to use the embedded Zookeeper 3.4.9 that come with Kafka 10.2, and not install Zookeeper separately. Each Kafka broker will always have a 1:1 Zookeeper on localhost.
So if I have 5 brokers on hosts A, B, C, D and E, each with a single Kafka and Zookeeper instance running on them, is it sufficient to just run the Zookeeper provided with Kafka?
What downsides or configuration limitations, if any, does the embedded 3.4.9 Zookeeper have compared to the standalone version?
These are a few reason not to run zookeeper on the same box as Kafka brokers.
They scale differently
5 zk and 5 Kafka works but 6:6 or 11:11 do not. You don't need more than 5 zookeeper nodes even for a quite large Kafka cluster. Unlike Kafka, Zookeeper replicates data to all nodes so it gets slower as you add more nodes.
They compete for disk I/O
Zookeeper is very disk I/O latency sensitive. You need to have it on a separate physical disk from the Kafka commit log or you run the risk that a lot of publishing to Kafka will slow zookeeper down and cause it to drop out of the ensemble causing potential problems.
They compete for page cache memory
Kafka uses Linux OS page cache to reduce disk I/O. When other apps run on the same box as Kafka you reduce or "pollute" the page cache with other data that takes away from cache for Kafka.
Server failures take down more infrastructure
If the box reboots you lose both a zookeeper and a broker at the same time.
Even though ZooKeeper comes with each Kafka release it does not mean they should run on the same server. Actually, it is advised that in a production environment they run on separate servers.
In the Kafka broker configuration you can specify the ZooKeeper address, and it can be local or remote. This is from broker config (config/server.properties):
# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=localhost:2181
You can replace localhost with any other accessible server name or IP address.
We've been running a setup as you described, with 3 to 5 nodes, each running a kafka broker and the zookeeper that comes with kafka distribution on the same nodes. No issues with that setup so far, but our data throughput isn't high.
If we were to scale above 5 nodes we'd separate them, so that we only scale kafka brokers but keep the zookeeper ensemble small. If zookeeper and kafka start competing for I/O too much, then we'd move their data directories to separate drives. If they start competing for CPU, then we'd move them to separate boxes.
All in all, it depends on your expected throughput and how easily you can upgrade your setup if it starts causing contention. You can start small and easy, with kafka and zookeeper co-located as long as you have the flexibility to upgrade your setup with more nodes and introduce separation later on. If you think this will be hard to add later, better start running them separate from the start. We've been running them co-located for 18+ months and haven't encountered resource contention so far.

How to span the kafka partitions across multiple VM's?

I am familiar with basic kafka system. I want to span a single kafka instance across 2 VM's such that some partitions are in one VM and some more in another VM. Please tell me how to configure this kind of system.
What do you mean by "to span kafka instance across 2 VMs" ? What you can due is having two different Kafka instances running on the 2 VMs. They should be configured in order to connect to the same Zookeeper cluster. When you create a new topic with a specific number of partitions, Kafka will span such partitions over the 2 VMs.