Can the preferred replica and the leader be different Brokers? - apache-kafka

When we run a topic describe, we can see in the fourth column the list of replicas in order of the preferred replicas, always I see that the first replica on the list is the same as leader. I wonder if Can the preferred replica and the leader be different Brokers?

They can be different
In an ideal scenario, the leader for a given partition should be the "preferred replica". This guarantees that the leadership load across the brokers in a cluster are evenly balanced. However, over time the leadership load could get imbalanced due to broker shutdowns (caused by controlled shutdown, crashes, machine failures etc).
https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-1.PreferredReplicaLeaderElectionTool

Related

Apache kafka about replica and partitions

I tried to follow
https://medium.com/#iet.vijay/kafka-multi-brokers-multi-consumers-and-message-ordering-b61ad7841875
to create multiple brokers and consumer.
I am able to produce message and consume the same.
when i try to describe the topic the below is the output which I got.
Can some one explain me about the partitions and leader and replicas here in above image.
All producer and consumer requests are sent to the leader broker, which is elected by the Kafka Controller.
Replicas are the non-leader broker. Replicas can be in or out of sync with the leader (ISR = "in sync replica")
The numbers that are shown are each of the broker.id values from the broker properties, which default to increment from 0 if not set
More details at https://kafka.apache.org/documentation/#replication
Worth pointing out that running multiple brokers on a single host is less than ideal; you still have a single point of failure and you're causing unnecessary duplicate writes on single hard drive for each replica

How failover works in kafka along with keeping up replication factor

I am trying to understand how failover and replication factors work in kafka.
Let's say my cluster has 3 brokers and replication factor is also 3. In this case each broker will have one copy of partition and one of the broker is leader. If leader broker fails, then one of the follower broker will become leader but now the replication factor is down to 2. At this point if I add a new broker in the cluster, will kafka make sure that replication factor is 3 and will it copy the required data on the new broker.
How will above scenario work if my cluster already has an addition broker.
In your setup (3 broker, 3 replicas), when 1 broker fails Kafka will automatically elect new leaders (on the remaining brokers) for all the partitions whose leaders were on the failing broker.
The replication factor does not change. The replication factor is a topic configuration that can only be changed by the user.
Similarly the Replica list does not change. This lists the brokers that should host each partition.
However, the In Sync Replicas (ISR) list will change and only contain the 2 remaining brokers.
If you add another broker to the cluster, what happens depend on its broker.id:
if the broker.id is the same as the broker that failed, this new broker will start replicating data and eventually join the ISR for all the existing partitions.
if it uses a different broker.id, nothing will happen. You will be able to create new topics with 3 replicas (that is not possible while there are only 2 brokers) but Kafka will not automatically replicate existing partitions. You can manually trigger a reassignment if needed, see the docs.
Leaving out partitions (which is another concept of Kafka):
The replication factor does not say how many times a topic is replicated, but rather how many times it should be replicated. It is not affected by brokers shutting down.
Once a leader broker shuts down, the "leader" status goes over to another broker which is in sync, that means a broker that has the current state replicated and is not behind. Electing "leader" status to a broker that is not in sync would obviously lead to data loss, so this will never happen (when using the right settings).
These replicas eligible for taking "leader status" are called in-sync replica (ISR), which is important, as there is a configuration called min.insync.replicas that specifies how many ISR have to exist for a Kafka message to be acknowledged. If this is set to 0, every Kafka message is acknowledged as "successful" as soon as it enters the "leader" broker, if this broker would die, all data that was not replicated yet is lost. If min.insync.replicas would be set to 1, every message waits with the acknowledgement, until at least 1 replica exists in order to be "successful", so if the broker would die now, there would be a replica covering this data. If there are not enough brokers to cover the minimum amount of replicas, your cluster will fail eventually.
So to answer your question: if you had 2 running brokers, min.insync.replicas=1 (default) and replication factor of 3, your cluster runs fine and will add a replica as soon as you start up another broker. If another of the 2 brokers dies before you launch the third one, you will run into problems.

Kafka setup strategy for replication?

I have two vm servers (say S1 and S2) and need to install kafka in cluster mode where there will be topic with only one partition and two replicas(one is leader in itself and other is follower ) for reliability.
Got high level idea from this cluster setup Want to confirm If below strategy is correct.
First set up zookeeper as cluster on both nodes for high availability(HA). If I do setup zk on single node only and then that node goes down, complete cluster
will be down. Right ? Is it mandatory to use zk in latest kafka version also ? Looks it is must for older version Is Zookeeper a must for Kafka?
Start the kafka broker on both nodes . It can be on same port as it is hosted on different nodes.
Create Topic on any node with partition 1 and replica as two.
zookeeper will select any broker on one node as leader and another as follower
Producer will connect to any broker and start publishing the message.
If leader goes down, zookeeper will select another node as leader automatically . Not sure how replica of 2 will be maintained now as there is only
one node live now ?
Is above strategy correct ?
Useful resources
ISR
ISR vs replication factor
First set up zookeeper as cluster on both nodes for high
availability(HA). If I do setup zk on single node only and then that
node goes down, complete cluster will be down. Right ? Is it mandatory
to use zk in latest kafka version also ? Looks it is must for older
version Is Zookeeper a must for Kafka?
Answer: Yes. Zookeeper is still must until KIP-500 will be released. Zookeeper is responsible for electing controller, storing metadata about Kafka cluster and managing broker membership (link). Ideally the number of Zookeeper nodes should be at least 3. By this way you can tolerate one node failure. (2 healthy Zookeeper nodes (majority in cluster) are still capable of selecting a controller)) You should also consider to set up Zookeeper cluster on different machines other than the machines that Kafka is installed. Thus the failure of a server won't lead to loss of both Zookeeper and Kafka nodes.
Start the kafka broker on both nodes . It can be on same port as it is
hosted on different nodes.
Answer: You should first start Zookeeper cluster, then Kafka cluster. Same ports on different nodes are appropriate.
Create Topic on any node with partition 1 and replica as two.
Answer: Partitions are used for horizontal scalability. If you don't need this, one partition is okay. By having replication factor 2, one of the nodes will be leader and one of the nodes will be follower at any time. But it is not enough for avoiding data loss completely as well as providing HA. You should have at least 3 Kafka brokers, 3 replication factor of topics, min.insync.replicas=2 as broker config and acks=all as producer config in the ideal configuration for avoiding data loss by not compromising HA. (you can check this for more information)
zookeeper will select any broker on one node as leader and another as
follower
Answer: Controller broker is responsible for maintaining the leader/follower relationship for all the partitions. One broker will be partition leader and another one will be follower. You can check partition leaders/followers with this command.
bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic my-replicated-topic
Producer will connect to any broker and start publishing the message.
Answer: Yes. Setting only one broker as bootstrap.servers is enough to connect to Kafka cluster. But for redundancy you should provide more than one broker in bootstrap.servers.
bootstrap.servers: A list of host/port pairs to use for establishing
the initial connection to the Kafka cluster. The client will make use
of all servers irrespective of which servers are specified here for
bootstrapping—this list only impacts the initial hosts used to
discover the full set of servers. This list should be in the form
host1:port1,host2:port2,.... Since these servers are just used for the
initial connection to discover the full cluster membership (which may
change dynamically), this list need not contain the full set of
servers (you may want more than one, though, in case a server is
down).
If leader goes down, zookeeper will select another node as leader
automatically . Not sure how replica of 2 will be maintained now as
there is only one node live now ?
Answer: If Controller broker goes down, Zookeeper will select another broker as new Controller. If broker which is leader of your partition goes down, one of the in-sync-replicas will be the new leader. (Controller broker is responsible for this) But of course, if you have just two brokers then replication won't be possible. That's why you should have at least 3 brokers in your Kafka cluster.
Yes - ZooKeeper is still needed on Kafka 2.4, but you can read about KIP-500 which plans to remove the dependency on ZooKeeper in the near future and start using the Raft algorithm in order to create the quorum.
As you already understood, if you will install ZK on a single node it will work in a standalone mode and you won't have any resiliency. The classic ZK ensemble consist 3 nodes and it allows you to lose 1 ZK node.
After pointing your Kafka brokers to the right ZK cluster you can start your brokers and the cluster will be up and running.
In your example, I would suggest you to create another node in order to gain better resiliency and met the replication factor that you wanted, while still be able to lose one node without losing data.
Bear in mind that using single partition means that you are bounded to single consumer per Consumer Group. The rest of the consumers will be idle.
I suggest you to read this blog about Kafka Best Practices and how to choose the number of topics/partitions in a Kafka cluster.

Kafka : Quorum-based approach to elect the new leader?

There are two common strategies for keeping replicas in sync, primary-backup replication and quorum-based replication as stated here
In primary-backup replication, the leader waits until the write
completes on every replica in the group before acknowledging the
client. If one of the replicas is down, the leader drops it from the
current group and continues to write to the remaining replicas. A
failed replica is allowed to rejoin the group if it comes back and
catches up with the leader. With f replicas, primary-backup
replication can tolerate f-1 failures.
In the quorum-based approach, the leader waits until a write completes
on a majority of the replicas. The size of the replica group doesn’t
change even when some replicas are down. If there are 2f+1 replicas,
quorum-based replication can tolerate f replica failures. If the
leader fails, it needs at least f+1 replicas to elect a new leader.
I have a question about the statement If the leader fails, it needs at least f+1 replicas to elect a new leader in quorum based approach. My question is why quorum(majority) of at f+1 replicas is required to elect a new leader ? Why not any replica out of f+1 in-synch-replica(ISR) is selected ? Why do
we need election instead of just simple any selection?
For election, how does zookeeper elect the final leader out of remaining replicas ? Does it compare which replica is latest updated ? Also why do I need the uneven number(say 3) of zookeper to elect a leader instead even number(say 2) ?
Also why do I need the uneven number(say 3) of zookeper to elect a
leader instead even number(say 2) ?
In a quorum based system like zookeeper, a leader election requires a simple majority out of an "ensemble" - ie, nodes which form zK cluster. So for a 3 node ensemble, one node failure could be tolerated if remaining two were to form a new ensemble and remain operational. On the other hand, in a four node ensemble also, you need at-least 3 nodes alive to form a majority, so it could tolerate only 1 node failure. A five node ensemble on the other hand could tolerate 2 node failures.
Now you see that a 3 node or 4 node cluster could effectively tolerate only 1 node failure, so it make sense to have an odd number of nodes to maximise number of nodes which could be down for a given cluster.
zK leader election relies on a Paxos like protocol called ZAB. Every write goes through the leader and leader generates a transaction id (zxid) and assigns it to each write request. The id represent the order in which the writes are applied on all replicas. A write is considered successful if the leader receives the ack from the majority. An explanation of ZAB.
My question is why quorum(majority) of at f+1 replicas is required to
elect a new leader ? Why not any replica out of f+1
in-synch-replica(ISR) is selected ? Why do we need election instead of
just simple any selection?
As for why election instead of selection - in general, in a distributed system with eventual consistency, you need to have an election because there is no easy way to know which of the remaining nodes has the latest data and is thus qualified to become a leader.
In case of Kafka -- for a setting with multiple replicas and ISRs, there could potentially be multiple nodes with up-todate data that of the leader.
Kafka uses zookeeper only as an enabler for leader election. If a Kafka partition leader is down, Kafka cluster controller gets informed of this fact via zK and cluster controller chooses one of the ISR to be the new leader. So you can see that this "election" is different from that of a new leader election in a quorum based system like zK.
Which broker among the ISR is "selected" is a bit more complicated (see) -
Each replica stores messages in a local log and maintains a few important offset positions in the log. The log end offset (LEO) represents the tail of the log. The high watermark (HW) is the offset of the last committed message. Each log is periodically synced to disks. Data before the flushed offset is guaranteed to be persisted on disks.
So when a leader fails, a new leader is elected by following:
Each surviving replica in ISR registers itself in Zookeeper.
The replica that registers first becomes the new leader. The new leader chooses its Log End Offset(LEO) as the new High Watermark (HW).
Each replica registers a listener in Zookeeper so that it will be informed of any leader change. Everytime a replica is notified about a new leader:
If the replica is not the new leader (it must be a follower), it truncates its log to its HW and then starts to catch up from the new leader.
The leader waits until all surviving replicas in ISR have caught up or a configured time has passed. The leader writes the current ISR to Zookeeper and opens itself up for both reads and writes.
Now you can probably appreciate the benefit of a primary backup model when compared to a quorum model - using the above strategy, a Kafka 3 node cluster with 2 ISRs can tolerate 2 node failures -- including a leader failure -- at the same time and still get a new leader elected (though that new leader would have to reject new writes for a while till one of the failed nodes come live and catches up with the leader).
The price to pay is of course higher write latency - in a 3 node Kafka cluster with 2 ISRs, the leader has to wait for an acknowledgement from both followers in-order to acknowledge the write to the client (producer). Whereas in a quorum model, a write could be acknowledged if one of the follower acknowledges.
So depending upon the usecase, Kafka offers the possibility to trade durability over latency. 2 ISRs means you have sometimes higher write latency, but higher durability. If you run with only one ISR, then in case you lose the leader and an ISR node, you either have no availability or you can choose an unclean leader election in which case you have lower durability.
Update - Leader election and preferred replicas:
All nodes which make up the cluster are already registered in zK. When one of the node fails, zK notifies the controller node(which itself is elected by zK). When that happens, one of the live ISRs are selected as new leader. But Kafka has the concept of "preferred replica" to balance leadership distribution across cluster nodes. This is enabled using auto.leader.rebalance.enable=true, under which case controller will try to hand over leadership to that preferred replica. This preferred replica is the first broker in the list of ISRs. This is all a bit complicated, but only Kafka admin need to know about this.

Kafka: What is the minimum number of brokers required for high availability?

Suppose I want to have highly available Kafka on production on small deployment.
I have to use following configs
min.insync.replicas=2 // Don't want to lose messages in case of 1 broker crash
default.replication.factor=3 // Will let producer write in case of 1 replica disappear with broker crash
Will Kafka start making new replica in case of 1 broker crash and 1 replica gone with it?
Do we have to have at least default.replication.factor number of brokers under any conditions to keep working?
In order to enable high availability in Kafka you need to take into account the following factors:
1. Replication factor: By default, replication factor is set to 1. The recommended replication-factor for production environments is 3 which means that 3 brokers are required.
2. Preferred Leader Election: When a broker is taken down, one of the replicas becomes the new leader for a partition. Once the broker that has failed is up and running again, it has no leader partitions and Kafka restores the information it missed while it was down, and it becomes the partition leader again. Preferred leader election is enabled by default. In order to minimize the risk of losing messages when switching back to the preferred leader you need to set the producer property acks to all (obviously this comes at a performance cost).
3. Unclean Leader Election:
You can enable unclean leader election in order to allow an out-of-sync replica to become the leader and maintain high availability of a partition. With unclean leader election, messages that were not synced to the new leader are lost. There is a trade-off between consistency and high availability meaning that with unclean leader election disabled, if a broker containing the leader replica for a partition becomes unavailable, and no in-sync replica exists to replace it, the partition becomes unavailable until the leader replica or another in-sync replica is back online.
4. Acknowledgements:
Acknowledgements refer to the number of replicas that commit a new message before the message is acknowledged using acks property. When acks is set to 0 the message is immediately acknowledged without waiting for other brokers to commit. When set to 1, the message is acknowledged once the leader commits the message. Configuring acks to all provides the highest consistency guarantee but slower writes to the cluster.
5. Minimum in-sync replicas: min.insync.replicas defines the minimum number o in-sync replicas that must be available for the producer in order to successfully send the messages to the partition.If min.insync.replicas is set to 2 and acks is set to all, each message must be written successfully to at least two replicas. This means that the messages won't be lost, unless both brokers fail (unlikely). If one of the brokers fails, the partition will no longer be available for writes. Again, this is a trade-off between consistency and availability.
Well, you can have replication.factor same as min.insync.replicas. But there may be some challenges.
As we know that during a broker outage, all partition replicas present on that broker become unavailable. That time availability of affected partitions is determined by the existence and status of their other replicas.
If a partition has no additional replica, the partition becomes totally unavailable. But if a partition has additional replicas that are in-sync, one of these in-sync replicas will become the interim partition leader. If the partition has addition replicas but none are in-sync, we have a choice to make: either we choose to wait for the partition leader to come back online–sacrificing availability — or allow an out-of-sync replica to become the interim partition leader–sacrificing consistency.
So in that case, it becomes for any partition to have an extra in-sync replica available to survive the loss of the partition leader.
That implies, that min.insync.replicas should be set to atleast 2.
In order to have a minimum ISR size of 2, replication-factor must be at least 2 as well.
However if there are only 2 replicas and one broker is unavailable, ISR size will decrease to 1 below minimum. Hence, it is better to have replication-factor greater than the minimum ISR size (at least 3).