I come across this phrase from https://niqdev.github.io/devops/kafka/
and https://livebook.manning.com/book/kafka-streams-in-action/chapter-2/109 (Kafka Streams in Action )
The controller broker is responsible for setting up leader/follower relationships for all partitions of a topic. If a Kafka node dies or is unresponsive (to ZooKeeper heartbeats), all of its assigned partitions (both leader and follower) are reassigned by the controller broker.
I think it is not correct assignment of follower partitions to other brokers - as the partitions wont heal themselves unless the broker comes back . I know it ONLY happens for leader replica where if the broker that has leader replica gone down, one of the broker that contains follower will become leader. But, I dont think "reassigment" of followers will happen automatically unless reassignment is initiated manually. Please add your inputs
The terminology might be a little off indeed but still applies. Followers are not necessarily assigned to other brokers but they need to change the endpoint to where they are going to send fetch requests. The follower's job is to stay in-sync with the leader, and if the leader has been assigned to a new broker because the old one failed then the followers need to send their fetch requests to the new elected broker. I think that is what reassignment means in the context that you shared.
Related
I understand Kafka has the ISR mechanism to manage leader-follower data replication, but I'm just wondering who exactly updates the ISR?
For each partition, the current leader tracks and manages the current in-sync replicas. Followers send Fetch requests to the leader to retrieve records. That enables the leader to keep track where each follower is and determine which ones are in-sync.
Before Kafka 2.7, each leader would update its partition state znode in ZooKeeper. Now (with KIP-497) when the in-sync replicas change, leaders send an AlterISR request to the controller and it's the controller that is responsible for updating ZooKeeper.
I am preparing for the CCDAK certification and I stepped into this question:
"A topic has three replicas and you set min.insync.replicas to 2. If two out of three replicas are not available, what happens when a consume request is sent to broker?"
The given answer is:
Data will be returned by the remaining in-sync replica
I had a doubt over a corner case: what happens when both the leader and the in-sync replica go down at the same time (third replica still not in-sync) and the consumer wants to read a message not present in the third replica (it may has not been copied yet since it is a not in-sync replica)?
Would a practical example be something like this: let's say consumer has committed offset 10 and wants to read from 11 but third replica has messages up to offset 9. Leader and follower are down, what would happen? Will the consumer just wait for the other brokers or will Kafka throw an error I have to deal with?
Your consumer will not be rebalanced because from Kafka side, It can not select a leader for that partition. The partition is not Synced and then Kafka will wait to start of other replicas to select a leader. In this scenario, Your consumer will hang.
Suppose default.replication.factor is set to one. For the sake of simplicity, let's say we have a topic with only one partition. We have a Kafka setup with three brokers. The topic we are interested in lives on the broker that just went down. Obviously, we won't have access to messages on this topic until the broker will be brought back, but my question is what will happen to messages for this topic that will come from producers while the broker is down? Will they be rejected?
The Producer will not be able to locate the primary replica of the partition because there will be no available primary replica because there will be no ISR's (in-sync replicas) at failover time. There will be an error, but I'm not exactly sure it's on the send, especially if you're batching sends.
I have two questions.
I wonder how to sychronized leader partion and follower partions.
If leader partition receive a message, then the leader broadcasting to follower partition on background communication? but It seemed kafka config file does not include these features(synchronization port info etc.)
If assume the following architecture.
Two brokers - Two partition - Two replicas
Broker#1 - leader partition#1, follower partition#2
Broker#2 - leader partition#2, follower partition#1
Sending messages will be round-robin to these two brokers...
If message#1 go to Broker#1(partition#1) and Broker#1 was shut down,
then broker#2 open the follower partition#1 and broker#2 has active two leader partition (for delivering the message#1)?
This is already handled by Kafka. You only need to define the replication factor for a topic. According to Kafka docs,
The partitions of the log are distributed over the servers in the
Kafka cluster with each server handling data and requests for a share
of the partitions. Each partition is replicated across a configurable
number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or
more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader for
some of its partitions and a follower for others so load is well
balanced within the cluster.
Your question is not clear. I believe my answer to this question should shed some light with regards to kafka partitions, distribution of messages and fault tolerance.
Kafka has the concept of a in-sync replica set, which is the set of nodes that aren't too far behind the leader.
What happens if the network cleanly partitions so that a minority containing the leader is on one side, and a majority containing the other in-sync nodes on the other side?
The minority/leader-side presumably thinks that it lost a bunch of nodes, reduces the ISR size accordingly, and happily carries on.
The other side probably thinks that it lost the leader, so it elects a new one and happily carries on.
Now we have two leaders in the same cluster, accepting writes independently. In a system that requires a majority of nodes to proceed after a partition, the old leader would step down and stop accepting writes.
What happens in this situation in Kafka? Does it require majority vote to change the ISR set? If so, is there a brief data loss until the leader side detects the outages?
I haven't tested this, but I think the accepted answer is wrong and Lars Francke is correct about the possibility of brain-split.
Zookeeper quorum requires a majority, so if ZK ensemble partitions, at most one side will have a quorum.
Being a controller requires having an active session with ZK (ephemeral znode registration). If the current controller is partitioned away from ZK quorum, it should voluntarily stop considering itself a controller. This should take at most zookeeper.session.timeout.ms = 6000. Brokers still connected to ZK quorum should elect a new controller among themselves. (based on this: https://stackoverflow.com/a/52426734)
Being a topic-partition leader also requires an active session with ZK. Leader that lost a connection to ZK quorum should voluntarily stop being one. Elected controller will detect that some ex-leaders are missing and will assign new leaders from the ones in ISR and still connected to ZK quorum.
Now, what happens to producer requests received by the partitioned ex-leader during ZK timeout window? There are some possibilities.
If producer's acks = all and topic's min.insync.replicas = replication.factor, then all ISR should have exactly the same data. The ex-leader will eventually reject in-progress writes and producers will retry them. The newly elected leader will not have lost any data. On the other hand it won't be able to serve any write requests until the partition heals. It will be up to producers to decide to reject client requests or keep retrying in the background for a while.
Otherwise, it is very probable that the new leader will be missing up to zookeeper.session.timeout.ms + replica.lag.time.max.ms = 16000 worth of records and they will be truncated from the ex-leader after the partition heals.
Let's say you expect longer network partitions than you are comfortable with being read-only.
Something like this can work:
you have 3 availability zones and expect that at most 1 zone will be partitioned from the other 2
in each zone you have a Zookeeper node (or a few), so that 2 zones combined can always form a majority
in each zone you have a bunch of Kafka brokers
each topic has replication.factor = 3, one replica in each availability zone, min.insync.replicas = 2
producers' acks = all
This way there should be two Kafka ISRs on ZK quorum side of the network partition, at least one of them fully up to date with ex-leader. So no data loss on the brokers, and available for writes from any producers that are still able to connect to the winning side.
In a Kafka cluster, one of the brokers is elected to serve as the controller.
Among other things, the controller is responsible for electing new leaders. The Replica Management section covers this briefly: http://kafka.apache.org/documentation/#design_replicamanagment
Kafka uses Zookeeper to try to ensure there's only 1 controller at a time. However, the situation you described could still happen, spliting both the Zookeeper ensemble (assuming both sides can still have quorum) and the Kafka cluster in 2, resulting in 2 controllers.
In that case, Kafka has a number of configurations to limit the impact:
unclean.leader.election.enable: False by default, this is used to prevent replicas that were not in-sync to ever become leaders. If no available replicas are in-sync, Kafka marks the partition as offline, preventing data loss
replication.factor and min.insync.replicas: For example, if you set them to 3 and 2 respectively, in case of a "split-brain" you can prevent producers from sending records to the minority side if they use acks=all
See also KIP-101 for the details about handling logs that have diverged once the cluster is back together.