There are two common strategies for keeping replicas in sync, primary-backup replication and quorum-based replication as stated here
In primary-backup replication, the leader waits until the write
completes on every replica in the group before acknowledging the
client. If one of the replicas is down, the leader drops it from the
current group and continues to write to the remaining replicas. A
failed replica is allowed to rejoin the group if it comes back and
catches up with the leader. With f replicas, primary-backup
replication can tolerate f-1 failures.
In the quorum-based approach, the leader waits until a write completes
on a majority of the replicas. The size of the replica group doesn’t
change even when some replicas are down. If there are 2f+1 replicas,
quorum-based replication can tolerate f replica failures. If the
leader fails, it needs at least f+1 replicas to elect a new leader.
I have a question about the statement If the leader fails, it needs at least f+1 replicas to elect a new leader in quorum based approach. My question is why quorum(majority) of at f+1 replicas is required to elect a new leader ? Why not any replica out of f+1 in-synch-replica(ISR) is selected ? Why do
we need election instead of just simple any selection?
For election, how does zookeeper elect the final leader out of remaining replicas ? Does it compare which replica is latest updated ? Also why do I need the uneven number(say 3) of zookeper to elect a leader instead even number(say 2) ?
Also why do I need the uneven number(say 3) of zookeper to elect a
leader instead even number(say 2) ?
In a quorum based system like zookeeper, a leader election requires a simple majority out of an "ensemble" - ie, nodes which form zK cluster. So for a 3 node ensemble, one node failure could be tolerated if remaining two were to form a new ensemble and remain operational. On the other hand, in a four node ensemble also, you need at-least 3 nodes alive to form a majority, so it could tolerate only 1 node failure. A five node ensemble on the other hand could tolerate 2 node failures.
Now you see that a 3 node or 4 node cluster could effectively tolerate only 1 node failure, so it make sense to have an odd number of nodes to maximise number of nodes which could be down for a given cluster.
zK leader election relies on a Paxos like protocol called ZAB. Every write goes through the leader and leader generates a transaction id (zxid) and assigns it to each write request. The id represent the order in which the writes are applied on all replicas. A write is considered successful if the leader receives the ack from the majority. An explanation of ZAB.
My question is why quorum(majority) of at f+1 replicas is required to
elect a new leader ? Why not any replica out of f+1
in-synch-replica(ISR) is selected ? Why do we need election instead of
just simple any selection?
As for why election instead of selection - in general, in a distributed system with eventual consistency, you need to have an election because there is no easy way to know which of the remaining nodes has the latest data and is thus qualified to become a leader.
In case of Kafka -- for a setting with multiple replicas and ISRs, there could potentially be multiple nodes with up-todate data that of the leader.
Kafka uses zookeeper only as an enabler for leader election. If a Kafka partition leader is down, Kafka cluster controller gets informed of this fact via zK and cluster controller chooses one of the ISR to be the new leader. So you can see that this "election" is different from that of a new leader election in a quorum based system like zK.
Which broker among the ISR is "selected" is a bit more complicated (see) -
Each replica stores messages in a local log and maintains a few important offset positions in the log. The log end offset (LEO) represents the tail of the log. The high watermark (HW) is the offset of the last committed message. Each log is periodically synced to disks. Data before the flushed offset is guaranteed to be persisted on disks.
So when a leader fails, a new leader is elected by following:
Each surviving replica in ISR registers itself in Zookeeper.
The replica that registers first becomes the new leader. The new leader chooses its Log End Offset(LEO) as the new High Watermark (HW).
Each replica registers a listener in Zookeeper so that it will be informed of any leader change. Everytime a replica is notified about a new leader:
If the replica is not the new leader (it must be a follower), it truncates its log to its HW and then starts to catch up from the new leader.
The leader waits until all surviving replicas in ISR have caught up or a configured time has passed. The leader writes the current ISR to Zookeeper and opens itself up for both reads and writes.
Now you can probably appreciate the benefit of a primary backup model when compared to a quorum model - using the above strategy, a Kafka 3 node cluster with 2 ISRs can tolerate 2 node failures -- including a leader failure -- at the same time and still get a new leader elected (though that new leader would have to reject new writes for a while till one of the failed nodes come live and catches up with the leader).
The price to pay is of course higher write latency - in a 3 node Kafka cluster with 2 ISRs, the leader has to wait for an acknowledgement from both followers in-order to acknowledge the write to the client (producer). Whereas in a quorum model, a write could be acknowledged if one of the follower acknowledges.
So depending upon the usecase, Kafka offers the possibility to trade durability over latency. 2 ISRs means you have sometimes higher write latency, but higher durability. If you run with only one ISR, then in case you lose the leader and an ISR node, you either have no availability or you can choose an unclean leader election in which case you have lower durability.
Update - Leader election and preferred replicas:
All nodes which make up the cluster are already registered in zK. When one of the node fails, zK notifies the controller node(which itself is elected by zK). When that happens, one of the live ISRs are selected as new leader. But Kafka has the concept of "preferred replica" to balance leadership distribution across cluster nodes. This is enabled using auto.leader.rebalance.enable=true, under which case controller will try to hand over leadership to that preferred replica. This preferred replica is the first broker in the list of ISRs. This is all a bit complicated, but only Kafka admin need to know about this.
Related
I am reading one of the article related to Kafka basics. If one of the Kafka brokerX dies in a cluster then, that brokerX data copies will move to other live brokers, which are in the cluster.
If that is the case, Is zookeeper/Kafka Controller will copy the brokerX data folder and move to live brokers like copy paste from one machine hard-disc to another (physical copy)?
Or, live brokers will share a common location ? so that will zookeeper/controller will link/point to the brokerX locations(logical copy) ?
I am little hard in understanding here. Could someone help me on this?
If a broker dies, it's dead. There's no background process that will copy data off of it
The replication of topics only happens while the broker is running
Plus, that image is wrong. The partitions = 2 means exactly that. A third partition doesn't just appear when a broker dies
This all depends if the topic has a replication factor > 1. In this case, brokers holding follower replica are constantly sending fetch requests to the leader replica (a specific broker), with a goal of being head to head with the leader (both the follower replica and leader replica having the same records stored on disk).
So when a broker goes down, all it takes is for the controller to select and promote an in-sync replica (by default, but could select non insync replicas) to take over as the leader of the partition. No copy/paste required, all brokers holding a partition(s) (as a follower replica or leader replica) of that topic are storing the same information prior to shutting down.
If a broker dies the behaviour depends on the dead broker. If it was not the leader for its partition it's non problem. when the broker returns on-line it will have to copy all missing data from the leader replica. If the dead broker was the leader for the partition a new leader will be elected according to some rules. If the new elected leader was in sync before the old leader died, there will be no message loss and the follower brokers will sync their replica from the new leader, as the broken leader will do when up again. If the new elected leader was not in sync you might have some message loss. Anyway you can drive the behaviour of your kafka cluster setting various parameters to balance speed, data integrity and reliability.
I am trying to understand how failover and replication factors work in kafka.
Let's say my cluster has 3 brokers and replication factor is also 3. In this case each broker will have one copy of partition and one of the broker is leader. If leader broker fails, then one of the follower broker will become leader but now the replication factor is down to 2. At this point if I add a new broker in the cluster, will kafka make sure that replication factor is 3 and will it copy the required data on the new broker.
How will above scenario work if my cluster already has an addition broker.
In your setup (3 broker, 3 replicas), when 1 broker fails Kafka will automatically elect new leaders (on the remaining brokers) for all the partitions whose leaders were on the failing broker.
The replication factor does not change. The replication factor is a topic configuration that can only be changed by the user.
Similarly the Replica list does not change. This lists the brokers that should host each partition.
However, the In Sync Replicas (ISR) list will change and only contain the 2 remaining brokers.
If you add another broker to the cluster, what happens depend on its broker.id:
if the broker.id is the same as the broker that failed, this new broker will start replicating data and eventually join the ISR for all the existing partitions.
if it uses a different broker.id, nothing will happen. You will be able to create new topics with 3 replicas (that is not possible while there are only 2 brokers) but Kafka will not automatically replicate existing partitions. You can manually trigger a reassignment if needed, see the docs.
Leaving out partitions (which is another concept of Kafka):
The replication factor does not say how many times a topic is replicated, but rather how many times it should be replicated. It is not affected by brokers shutting down.
Once a leader broker shuts down, the "leader" status goes over to another broker which is in sync, that means a broker that has the current state replicated and is not behind. Electing "leader" status to a broker that is not in sync would obviously lead to data loss, so this will never happen (when using the right settings).
These replicas eligible for taking "leader status" are called in-sync replica (ISR), which is important, as there is a configuration called min.insync.replicas that specifies how many ISR have to exist for a Kafka message to be acknowledged. If this is set to 0, every Kafka message is acknowledged as "successful" as soon as it enters the "leader" broker, if this broker would die, all data that was not replicated yet is lost. If min.insync.replicas would be set to 1, every message waits with the acknowledgement, until at least 1 replica exists in order to be "successful", so if the broker would die now, there would be a replica covering this data. If there are not enough brokers to cover the minimum amount of replicas, your cluster will fail eventually.
So to answer your question: if you had 2 running brokers, min.insync.replicas=1 (default) and replication factor of 3, your cluster runs fine and will add a replica as soon as you start up another broker. If another of the 2 brokers dies before you launch the third one, you will run into problems.
The controller elects a new leader from the ISR for a partition when the current one dies. My understanding is that this data is persisted in Zookeeper. What happens when a Zookeeper node dies during this write? Can this mean that some brokers might still different leader for the newly-leader-elected partition?
I tried digging around the docs but could not find anything satisfactory.
My understanding is that this data is persisted in Zookeeper
Yes, This ISR set is persisted to ZooKeeper whenever it changes. (Reference)
What happens when a Zookeeper node dies during this write?
Zookeeper works on quorum and it means majority of servers from the cluster. (See this SO answer) (snippet below)
With a 3 node cluster, the majority is 2 nodes. So you can tolerate
only 1 node not being in sync at the same time.
With a 5 node cluster, the majority is 3 nodes. So you can tolerate
only 2 nodes not being in sync at the same time.
So long as there is majority, the decision is made and so the leader election will continue.
When we run a topic describe, we can see in the fourth column the list of replicas in order of the preferred replicas, always I see that the first replica on the list is the same as leader. I wonder if Can the preferred replica and the leader be different Brokers?
They can be different
In an ideal scenario, the leader for a given partition should be the "preferred replica". This guarantees that the leadership load across the brokers in a cluster are evenly balanced. However, over time the leadership load could get imbalanced due to broker shutdowns (caused by controlled shutdown, crashes, machine failures etc).
https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-1.PreferredReplicaLeaderElectionTool
Suppose I want to have highly available Kafka on production on small deployment.
I have to use following configs
min.insync.replicas=2 // Don't want to lose messages in case of 1 broker crash
default.replication.factor=3 // Will let producer write in case of 1 replica disappear with broker crash
Will Kafka start making new replica in case of 1 broker crash and 1 replica gone with it?
Do we have to have at least default.replication.factor number of brokers under any conditions to keep working?
In order to enable high availability in Kafka you need to take into account the following factors:
1. Replication factor: By default, replication factor is set to 1. The recommended replication-factor for production environments is 3 which means that 3 brokers are required.
2. Preferred Leader Election: When a broker is taken down, one of the replicas becomes the new leader for a partition. Once the broker that has failed is up and running again, it has no leader partitions and Kafka restores the information it missed while it was down, and it becomes the partition leader again. Preferred leader election is enabled by default. In order to minimize the risk of losing messages when switching back to the preferred leader you need to set the producer property acks to all (obviously this comes at a performance cost).
3. Unclean Leader Election:
You can enable unclean leader election in order to allow an out-of-sync replica to become the leader and maintain high availability of a partition. With unclean leader election, messages that were not synced to the new leader are lost. There is a trade-off between consistency and high availability meaning that with unclean leader election disabled, if a broker containing the leader replica for a partition becomes unavailable, and no in-sync replica exists to replace it, the partition becomes unavailable until the leader replica or another in-sync replica is back online.
4. Acknowledgements:
Acknowledgements refer to the number of replicas that commit a new message before the message is acknowledged using acks property. When acks is set to 0 the message is immediately acknowledged without waiting for other brokers to commit. When set to 1, the message is acknowledged once the leader commits the message. Configuring acks to all provides the highest consistency guarantee but slower writes to the cluster.
5. Minimum in-sync replicas: min.insync.replicas defines the minimum number o in-sync replicas that must be available for the producer in order to successfully send the messages to the partition.If min.insync.replicas is set to 2 and acks is set to all, each message must be written successfully to at least two replicas. This means that the messages won't be lost, unless both brokers fail (unlikely). If one of the brokers fails, the partition will no longer be available for writes. Again, this is a trade-off between consistency and availability.
Well, you can have replication.factor same as min.insync.replicas. But there may be some challenges.
As we know that during a broker outage, all partition replicas present on that broker become unavailable. That time availability of affected partitions is determined by the existence and status of their other replicas.
If a partition has no additional replica, the partition becomes totally unavailable. But if a partition has additional replicas that are in-sync, one of these in-sync replicas will become the interim partition leader. If the partition has addition replicas but none are in-sync, we have a choice to make: either we choose to wait for the partition leader to come back online–sacrificing availability — or allow an out-of-sync replica to become the interim partition leader–sacrificing consistency.
So in that case, it becomes for any partition to have an extra in-sync replica available to survive the loss of the partition leader.
That implies, that min.insync.replicas should be set to atleast 2.
In order to have a minimum ISR size of 2, replication-factor must be at least 2 as well.
However if there are only 2 replicas and one broker is unavailable, ISR size will decrease to 1 below minimum. Hence, it is better to have replication-factor greater than the minimum ISR size (at least 3).