How Kafka distributes the topic partitions among the brokers - apache-kafka

I have 3 Kafka brokers in 3 different VMs, with one additionally running a Zookeeper. I now create a topic with 8 partitions. The Producer pushes messages to these group of brokers on the created "topic".
How does the Kafka distribute a topic and its partitions among the brokers?
Does the Kafka redistribute the topic when a new Kafka Broker joins a cluster?
Can the topic partition be increased after the topic was created?

When you create a new topic, Kafka places the partitions and replicas in a way that the brokers with least number of existing partitions are used first, and replicas for same partition are on different brokers.
When you add a new broker, it is used for new partitions (since it has lowest number of existing partitions), but there is no automatic balancing of existing partitions to the new broker. You can use the replica-reassignment tool to move partitions and replicas to the new broker.
Yes, you can add partitions to an existing topic.

Related

Kafka Producer, Consumer, Broker in same host?

Are there any downsides to running the same producer and consumer code for all nodes in the cluster? If there are 8 nodes in the cluster (8 consumer, 8 kafka broker, and 8 producers), would 8 producers be running at the same time in the cluster then? Is there a way to modify cluster so that only one producer runs at a time?
Kafka cluster is nothing but Kafka brokers running under a distributed consensus. Kafka cluster is agnostic about number of producers and consumers running around it. Producers and consumers are clients of the Kafka cluster. Producers will stream data to Kafka and consumers consume the data from Kafka. Within Kafka cluster data will be distributed within topics. Topics are sharded using partitions. If multiple consumers belong to the same consumer group consumers can work in a self healing fashion.
Is there a way to modify cluster so that only one producer runs at a
time?
If you intend to run a single producer at certain point of time, you don't need to make any change within cluster.
Are there any downsides to running the same producer and consumer code for all nodes in the cluster?
The primary downsides here would be scalability and memory usage.
Producers and Consumers are not required to run on Brokers. Producers should be deployed where data is being generated (or running as separate hosts, like Kafka Connect workers).
Consumers should be scaled out independently based on the throughput and ordering guarantees that you need in your downstream systems.
There is nothing that says 8 brokers requires 8 producers and 8 consumers; partitions are what matters more
If you have N partitions in a topic, you can only scale to N active consumers anyway, and infinitely many producers
8 brokers can hold lots of partitions for any given topic
Running a single producer is an implementation of your own code. The broker cannot force it.

Kafka does not replicate a topic to thoes brokers which were not assigned to the topic when it was created?

I have a topic "reptop" with replication factor 3. My cluster consist of 4 brokers [IDs: 0,1,2,3]. When the topic was created brokers 0,2 and 3 were assigned to the topic, with leader as '2', now when one of my brokers, leader or follower goes down Kafka does not replicate the topic to broker:1 even though it is healthy and the ISR is less than replication-factor, but when the broker which had gone down and was initially assigned to the topic, comes back up kafka replicates the topic to this node. So the question is why does the kafka not replicate the topic to the brokers that were not assigned the topic when the topic was created even though there are healthy brokers on the cluster and "ISR
This is by design. If you want to reassign the partitions, you must do so with the reassignment tool. Another option is to bring up a new broker instance with the missing ID. Kafka does not "self heal" like say hdfs and there are many cases where you wouldn't want it to. If you want it to, there are told out there like confluent rebalancer that can be used.

Kafka partition in relation to a broker

Is a partition always on the same physical machine as a broker or can a partition reside on a machine which is not also a broker?
I am pretty sure a partition or multiple partitions can reside on a broker node but I am not sure if a partition can reside on a non broker node?
Partition is just a structure / object which resides inside the broker. Without running Kafka broker, there is no partition. They cannot exist outside of broker.
The Kafka brokers run in clusters - a Kafka cluster can consist only from one broker, but it can be 1000s of brokers. When you create a topic with a defined number of partitions, they will be distributed (either automatically, or you can specify this distribution) across the brokers in the cluster. So if you want to use multiple machines for your topics / partitions, you will need to run a Kafka broker on each of these machines and connect them into a cluster.
No a partition cannot reside on multiple machine in kafka ... Partitions cannot be split between multiple brokers and not even between multiple disks on the same broker .....In other words you can say that the size of the partition is limited by the space in the disk mount.

Kafka Topic Distribution among brokers

When creating topics, can we determine which broker will be the leader for the topic? Are topics balanced across brokers in Kafka? (Considering the topics have just one partition)
Kafka does manage this internally and you don't need to worry about this in general: http://kafka.apache.org/documentation/#basic_ops_leader_balancing
If you create a new topic, Kafka will select a broker based on load. If a topic has only one partitions, it will only be hosted on a single broker (plus followers if you have multiple replicas), because a partitions cannot be split over multiple brokers in Kafka.
Nevertheless, you can get the information which broker host what topic and you can also "move" topics and partitions: http://kafka.apache.org/documentation/#basic_ops_cluster_expansion

One Kafka broker connects to multiple zookeepers

I'm new to Kafka, zookeeper and Storm.
I our environment we have one Kafka broker connecting to multiple zookeepers. Is there an advantage having the producer send the messages to a specific topic and partition on one broker to multiple zookeepers vs multiple brokers to multiple zookeepers?
Yes there is. Kafka allows you to scale by adding brokers. When you use a Kafka cluster with a single broker, as you have, all partitions reside on that single broker. But when you have multiple brokers, Kafka will split the partitions between them. So, broker A may be elected leader for partitions 1 and 2 of your topic, and broker B leader for partition 3. So, when you publish messages to the topic, the client will split the messages between the various partitions on the two brokers.
Note that I also mentioned leader election. Adding brokers to your Kafka cluster gives you replication. Kafka uses ZooKeeper to elect a leader for each partition as I mentioned in my example. Once a leader is elected, the client splits messages among partitions and sends each message to the leader for the appropriate partition. Depending on the topic configuration, the leader may synchronously replicate messages to a backup. So, in my example, if the replication factor for the topic is 2 then broker A will synchronously replicate messages for partitions 1 and 2 to broker B and broker B will synchronously replicate messages for partition 3 to broker A.
So, that's all to say that adding brokers gives you both scalability and fault-tolerance.