Kafka rack-id and min in-sync replicas - apache-kafka

Kafka has introduced rack-id to provide redundancy capabilities if a whole rack fails.
There is a min in-sync replica setting to specify the minimum number of replicas that need to be in-sync before a producer receives an ack (-1 / all config).
There is an unclean leader election setting to specify whether a leader can be elected when it is not in-sync.
So, given the following scenario:
Two racks. Rack 1, 2.
Replication count is 4.
Min in-sync replicas = 2
Producer ack=-1 (all).
Unclean leader election = false
Aiming to have at least once message delivery, redundancy of nodes and tolerant to a rack failure.
Is it possible that there is a moment where the two in-sync replicas both come from rack 1, so the producer receives an ack and at that point rack 1 crashes (before any replicas from rack 2 are in-sync)?
This means that rack 2 will only contain unclean replicas and no producers would be able to add messages to the partition essentially grinding to a halt. The replicas would be unclean so no new leader could be elected in any case.
Is my analysis correct, or is there something under the hood to ensure that the replicas forming min in-sync replicas have to be from different racks?
Since replicas on the same rack would have lower latency it seems that the above scenario is reasonably likely.
The scenario is shown in the image below:

To be technically correct you should fix some of the questions wording. It is not possible to have out of sync replicas "available". Also the min in-sync replica setting specifies the minimum number of replicas that need to be in-sync for the partition to remain available for writes. When a producer specifies ack (-1 / all config) it will still wait for acks from all in sync replicas at that moment (independent of the setting for min in-sync replicas). So if you publish when 4 replicas are in sync then you will not get an ack unless all 4 replicas commit the message (even if min in-sync replicas is configured as 2). It's still possible to construct a scenario similar to your question that highlight the same tradeoff problem by having 2 partitions in rack 2 out of sync first, then publish when the only 2 ISRs are in rack 1, and then take rack 1 down. In that case those partitions would be unavailable for read or write. So the easiest fix to this problem would be to increase min in-sync replicas to 3. Another less fault tolerant fix would be to reduce replication factor to 3.

Yes, I think It is possible. Because Kafka can only maintain the ISR according to the runtime's fact, not by its spirit.
words from https://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka
for each partition of a topic, we maintain an in-sync replica set (ISR). This is the set of replicas that are alive and have fully caught up with the leader (note that the leader is always in ISR). When a partition is created initially, every replica is in the ISR. When a new message is published, the leader waits until it reaches all replicas in the ISR before committing the message. If a follower replica fails, it will be dropped out of the ISR and the leader then continues to commit new messages with fewer replicas in the ISR. Notice that now, the system is running in an under replicated mode.
Words from https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Replication
After a configured timeout period, the leader will drop the failed follower from its ISR and writes will continue on the remaining replicas in ISR. If the failed follower comes back, it first truncates its log to the last checkpointed HW. It then starts to catch up all messages after its HW from the leader. When the follower fully catches up, the leader will add it back to the current ISR.
The min in-sync replicas you mentioned is just a limit number, the ISR size does not depend on it. this settings means if the producer's ack is "all" and the ISR size is less than min, then kafka will refuse to write this message.
So in the first time, the ISR is {1,2,3,4}, and if the broker 3 or 4 fall down, It will be kicked out from ISR. And the case you mentioned will happen.
When rack 1's broker failed, this will be an unclean leader election.

Related

ack=all and min ISR

If my producers have ack=all and ISR is 2 and partitions is 2 what is the scenario ?
Number of partitions is irrelevant to the acks setting. The producer will (always) write to the leader partition (pre-determined by the client partitioner).
When you set acks=all, then it'll block and wait for every replica of the record to be written across the cluster using the replication factor of the topic, not ISR setting.
The ISR setting only determines when a topic is unavailable due to brokers being offline, and doesn't affect the producer.

A topic has 3 replicas and min.insyinc.replicas to 2, what happens when both the follower and in-syinc replica go down and the consumer wants to read?

I am preparing for the CCDAK certification and I stepped into this question:
"A topic has three replicas and you set min.insync.replicas to 2. If two out of three replicas are not available, what happens when a consume request is sent to broker?"
The given answer is:
Data will be returned by the remaining in-sync replica
I had a doubt over a corner case: what happens when both the leader and the in-sync replica go down at the same time (third replica still not in-sync) and the consumer wants to read a message not present in the third replica (it may has not been copied yet since it is a not in-sync replica)?
Would a practical example be something like this: let's say consumer has committed offset 10 and wants to read from 11 but third replica has messages up to offset 9. Leader and follower are down, what would happen? Will the consumer just wait for the other brokers or will Kafka throw an error I have to deal with?
Your consumer will not be rebalanced because from Kafka side, It can not select a leader for that partition. The partition is not Synced and then Kafka will wait to start of other replicas to select a leader. In this scenario, Your consumer will hang.

Does Kafka producer write keyless message to another partition when a partition is down?

For example, I have a topic that has 2 partitions and a producer using defaultpartitioner (round-robin I assumed) writes to the topic. At some point, partition 1 becomes unavailable because all of the replica brokers go offline. Assuming the messages have no specified keys, will the producer resend the messages to partition 2? or simply gets stuck?
That is an interesting question and we should look at it from a broader (cluster) perspective.
At some point, partition 1 becomes unavailable because all of the replica brokers go offline.
I see the following scenarios:
All replica brokers of partition one are different to the replica brokers of partition two.
All replica brokers of partition one are the same as for partition two.
Some replica brokers are the same as for partition two.
In scenario "1" it means you still have enough brokers alive as the replication factor is a topic-wide not a partition-based configuration. In that case as soon as the first broker goes down its data will be moved to another broker to ensure that your partition always has enough in-sync replicas.
In scenarios "2" both partitions become unavailable and your KafkaProducer will eventually time out. Now, it depends if you have other brokers that are alive and can take on the data of the partitions.
In scenario "3" the dead replicas would be shifted to running brokers. During that time the KafkaProducer will only write to partition 2 as this is the only available partition in the topic. As soon as partition 1 has enough in-sync replicas the producer will start producing again to both partitions.
Actually, I could think of many more scenarios. If you need a more concrete answer you need to specify
how many brokers you have,
what your replication factor actually is and
in what timely order which broker goes down.
Assuming the messages have no specified keys, will the producer resend the messages to partition 2?
The KafkaProducer will not re-send the data that was previously send to partition 1 to partition 2. Whatever was written to partition 1 will stay in partition 1.

How to achieve strong consistency in Kafka?

Try to understanding consistency maintenance in Kafka. Please find the scenario and help to understand.
Number of partition = 2
Replication factor = 3
Number of broker in the cluster = 4
In that case, for achieving the strong consistency how many nodes should acknowledge. Either ack = all or ack = 3 or any other value. Please confirm for the same.
You might be interested in seeing When it Absolutely, Positively, Has to be There talk from Kafka Summit.
Which was given by an engineer at Cloudera, and Cloudera has their own documenation on Kafka availability
To summarize, more than 1 replica and higher than 1 in-sync replica is a good start. Then on the producer, if you are okay with sacrificing throughput for data availability, meaning you must have all replicas be written before continuing, then acks=all. Otherwise, if you trust the leader broker to be highly available with unclean leader election is false, then acks=1 should be okay in most cases.
acks=3 isn't a valid config, by the way. I think you are looking for min.insync.replicas=2 and acks=all with a replication factor of 3; from above link
If min.insync.replicas is set to 2 and acks is set to all, each message must be written successfully to at least two replicas. This guarantees that the message is not lost unless both hosts crash
Also, you can enable the transactional producer, as of Kafka 0.11 to work towards exactly once processing
enable.idempotence=true
In your setting, what you have is
4 brokers
Replication factor = 3
That means each message in a given partition will be replicated to 3 out of 4 brokers, including the leader for that partition.
In-order to achieve strong consistency guarantees, you have to set min.insync.replicas to 2 and use acks=all. This way, you are guaranteed that each write goes to at-least 2 out of 3 brokers which hold the data, before which it is acknowledged.
Setting acks to all provides the highest consistency guarantee at the expense of slower writes to the cluster.
If you use older versions of Kafka where unclean leader election is true by default, you should also consider setting that to false explicitly. This way, an out of sync. broker won't be elected as the leader in case of leader crashes (effectively compromising availability).
Also, Kafka is a system where all the reads go through the leader. This is a bit different from some other distributed system such as zookeeper which supports read replicas. So you do not have a situation where a client ends up reading directly from a stale broker. Leader ensures that writes are ordered and replicated to designated number of in-sync replicas and acknowledged based on your acks setting.
If you are looking for consistency as in realm of ACID property, all replicas need to be acknowledged. Since you have 3 replicas, all of those 3 nodes should be acknowledged.

How does kafka handle network partitions?

Kafka has the concept of a in-sync replica set, which is the set of nodes that aren't too far behind the leader.
What happens if the network cleanly partitions so that a minority containing the leader is on one side, and a majority containing the other in-sync nodes on the other side?
The minority/leader-side presumably thinks that it lost a bunch of nodes, reduces the ISR size accordingly, and happily carries on.
The other side probably thinks that it lost the leader, so it elects a new one and happily carries on.
Now we have two leaders in the same cluster, accepting writes independently. In a system that requires a majority of nodes to proceed after a partition, the old leader would step down and stop accepting writes.
What happens in this situation in Kafka? Does it require majority vote to change the ISR set? If so, is there a brief data loss until the leader side detects the outages?
I haven't tested this, but I think the accepted answer is wrong and Lars Francke is correct about the possibility of brain-split.
Zookeeper quorum requires a majority, so if ZK ensemble partitions, at most one side will have a quorum.
Being a controller requires having an active session with ZK (ephemeral znode registration). If the current controller is partitioned away from ZK quorum, it should voluntarily stop considering itself a controller. This should take at most zookeeper.session.timeout.ms = 6000. Brokers still connected to ZK quorum should elect a new controller among themselves. (based on this: https://stackoverflow.com/a/52426734)
Being a topic-partition leader also requires an active session with ZK. Leader that lost a connection to ZK quorum should voluntarily stop being one. Elected controller will detect that some ex-leaders are missing and will assign new leaders from the ones in ISR and still connected to ZK quorum.
Now, what happens to producer requests received by the partitioned ex-leader during ZK timeout window? There are some possibilities.
If producer's acks = all and topic's min.insync.replicas = replication.factor, then all ISR should have exactly the same data. The ex-leader will eventually reject in-progress writes and producers will retry them. The newly elected leader will not have lost any data. On the other hand it won't be able to serve any write requests until the partition heals. It will be up to producers to decide to reject client requests or keep retrying in the background for a while.
Otherwise, it is very probable that the new leader will be missing up to zookeeper.session.timeout.ms + replica.lag.time.max.ms = 16000 worth of records and they will be truncated from the ex-leader after the partition heals.
Let's say you expect longer network partitions than you are comfortable with being read-only.
Something like this can work:
you have 3 availability zones and expect that at most 1 zone will be partitioned from the other 2
in each zone you have a Zookeeper node (or a few), so that 2 zones combined can always form a majority
in each zone you have a bunch of Kafka brokers
each topic has replication.factor = 3, one replica in each availability zone, min.insync.replicas = 2
producers' acks = all
This way there should be two Kafka ISRs on ZK quorum side of the network partition, at least one of them fully up to date with ex-leader. So no data loss on the brokers, and available for writes from any producers that are still able to connect to the winning side.
In a Kafka cluster, one of the brokers is elected to serve as the controller.
Among other things, the controller is responsible for electing new leaders. The Replica Management section covers this briefly: http://kafka.apache.org/documentation/#design_replicamanagment
Kafka uses Zookeeper to try to ensure there's only 1 controller at a time. However, the situation you described could still happen, spliting both the Zookeeper ensemble (assuming both sides can still have quorum) and the Kafka cluster in 2, resulting in 2 controllers.
In that case, Kafka has a number of configurations to limit the impact:
unclean.leader.election.enable: False by default, this is used to prevent replicas that were not in-sync to ever become leaders. If no available replicas are in-sync, Kafka marks the partition as offline, preventing data loss
replication.factor and min.insync.replicas: For example, if you set them to 3 and 2 respectively, in case of a "split-brain" you can prevent producers from sending records to the minority side if they use acks=all
See also KIP-101 for the details about handling logs that have diverged once the cluster is back together.