Raft replication under partition - consensus

7 member cluster, one of which is the leader.
Leader attempts to replicate log (some write)
Network partition occurs. 3 and 4 members respectively.
Leader ends up in minority partition
Leader only reaches 2 followers → replication failure
What happens in this situation?
As I understand it: The 2 followers have applied a "bad" write and when the network partition mends they will overwrite that write with the majority leaders history. But this violates linearization.
🤔

You're confusing replication with commitment. Merely replicating an entry to a minority of this cluster doesn't break linearizability. What's important is when that change is considered committed. Since the leader on the minority side of the partition is unable to replicate the change to a majority of the cluster, it will never commit the change and will never acknowledge to a client that the change has been persisted. Furthermore, the uncommitted change will never have been applied to the state machine on any node. Therefore, overwriting the uncommitted change when the partition is healed does not break any guarantees.
Guarantees would only be broken if the leader were to increase the commitIndex and acknowledge a write after replicating only to a minority of the cluster.

Related

How to achieve strong consistency in Kafka?

Try to understanding consistency maintenance in Kafka. Please find the scenario and help to understand.
Number of partition = 2
Replication factor = 3
Number of broker in the cluster = 4
In that case, for achieving the strong consistency how many nodes should acknowledge. Either ack = all or ack = 3 or any other value. Please confirm for the same.
You might be interested in seeing When it Absolutely, Positively, Has to be There talk from Kafka Summit.
Which was given by an engineer at Cloudera, and Cloudera has their own documenation on Kafka availability
To summarize, more than 1 replica and higher than 1 in-sync replica is a good start. Then on the producer, if you are okay with sacrificing throughput for data availability, meaning you must have all replicas be written before continuing, then acks=all. Otherwise, if you trust the leader broker to be highly available with unclean leader election is false, then acks=1 should be okay in most cases.
acks=3 isn't a valid config, by the way. I think you are looking for min.insync.replicas=2 and acks=all with a replication factor of 3; from above link
If min.insync.replicas is set to 2 and acks is set to all, each message must be written successfully to at least two replicas. This guarantees that the message is not lost unless both hosts crash
Also, you can enable the transactional producer, as of Kafka 0.11 to work towards exactly once processing
enable.idempotence=true
In your setting, what you have is
4 brokers
Replication factor = 3
That means each message in a given partition will be replicated to 3 out of 4 brokers, including the leader for that partition.
In-order to achieve strong consistency guarantees, you have to set min.insync.replicas to 2 and use acks=all. This way, you are guaranteed that each write goes to at-least 2 out of 3 brokers which hold the data, before which it is acknowledged.
Setting acks to all provides the highest consistency guarantee at the expense of slower writes to the cluster.
If you use older versions of Kafka where unclean leader election is true by default, you should also consider setting that to false explicitly. This way, an out of sync. broker won't be elected as the leader in case of leader crashes (effectively compromising availability).
Also, Kafka is a system where all the reads go through the leader. This is a bit different from some other distributed system such as zookeeper which supports read replicas. So you do not have a situation where a client ends up reading directly from a stale broker. Leader ensures that writes are ordered and replicated to designated number of in-sync replicas and acknowledged based on your acks setting.
If you are looking for consistency as in realm of ACID property, all replicas need to be acknowledged. Since you have 3 replicas, all of those 3 nodes should be acknowledged.

Why Kafka is not P in CAP theorem

The main developer of Kafka said Kafka is CA but P in CAP theorem. But I'm so confused, is Kafka not Partition tolerate? I think it does, when one replication is down the other would become leader and continue work!
Also, I would like to know what if Kafka uses P? Would P hurt C or A?
If you read how CAP defines C, A and P, "CA but not P" just means that when an arbitrary network partition happens, each Kafka topic-partition will either stop serving requests (lose A), or lose some data (lose C), or both, depending on its settings and partition's specifics.
If a network partition splits all ISRs from Zookeeper, with default configuration unclean.leader.election.enable = false, no replicas can be elected as a leader (lose A).
If at least one ISR can connect, it will be elected, so it can still serve requests (preserve A). But with default min.insync.replicas = 1 an ISR can lag behind the leader by approximately replica.lag.time.max.ms = 10000. So by electing it Kafka potentially throws away writes confirmed to producers by the ex-leader (lose C).
Kafka can preserve both A and C for some limited partitions. E.g. you have min.insync.replicas = 2 and replication.factor = 3, and all 3 replicas are in-sync when a network partition happens, and it splits off at most 1 ISR (either a single-node failures, or a single-DC failure or a single cross-DC link failure).
To preserve C for arbitrary partitions, you have to set min.insync.replicas = replication.factor. This way, no matter which ISR is elected, it is guaranteed to have the latest data. But at the same time it won't be able to serve write requests until the partition heals (lose A).
CAP Theorem states that any distributed system can provide at most two out of the three guarantees: Consistency, Availability and Partition tolerance.
According to the Engineers at LinkedIn (where Kafka was initially founded) Kafka is a CA system:
All distributed systems must make trade-offs between guaranteeing
consistency, availability, and partition tolerance (CAP Theorem). Our
goal was to support replication in a Kafka cluster within a single
datacenter, where network partitioning is rare, so our design focuses
on maintaining highly available and strongly consistent replicas.
Strong consistency means that all replicas are byte-to-byte identical,
which simplifies the job of an application developer.
However, I would say that it depends on your configuration and more precisely on the variables acks, min.insync.replicas and replication.factor. According to the docs,
If a topic is configured with only two replicas and one fails (i.e.,
only one in sync replica remains), then writes that specify acks=all
will succeed. However, these writes could be lost if the remaining
replica also fails. Although this ensures maximum availability of the
partition, this behavior may be undesirable to some users who prefer
durability over availability. Therefore, we provide two topic-level
configurations that can be used to prefer message durability over
availability:
Disable unclean leader election - if all replicas become unavailable, then the partition will remain unavailable until the most
recent leader becomes available again. This effectively prefers
unavailability over the risk of message loss. See the previous section
on Unclean Leader Election for clarification.
Specify a minimum ISR size - the partition will only accept writes if the size of the ISR is above a certain minimum, in order to prevent
the loss of messages that were written to just a single replica, which
subsequently becomes unavailable. This setting only takes effect if
the producer uses acks=all and guarantees that the message will be
acknowledged by at least this many in-sync replicas. This setting
offers a trade-off between consistency and availability. A higher
setting for minimum ISR size guarantees better consistency since the
message is guaranteed to be written to more replicas which reduces the
probability that it will be lost. However, it reduces availability
since the partition will be unavailable for writes if the number of
in-sync replicas drops below the minimum threshold.
CAP is a proofed theorem so there is no distributed system that can have features C, A and P altogether during failure. In case Kafka uses the P, that is when the cluster split into two or more isolate part it can continue the functioning, one of the C or A should be sacrificed.
Maybe if we consider Kafka and Zookeeper nodes as a whole cluster, because Kafka needs zookeeper nodes, we can not consider it partition tolerant in case of losing connection to zookeeper nodes.

How does kafka handle network partitions?

Kafka has the concept of a in-sync replica set, which is the set of nodes that aren't too far behind the leader.
What happens if the network cleanly partitions so that a minority containing the leader is on one side, and a majority containing the other in-sync nodes on the other side?
The minority/leader-side presumably thinks that it lost a bunch of nodes, reduces the ISR size accordingly, and happily carries on.
The other side probably thinks that it lost the leader, so it elects a new one and happily carries on.
Now we have two leaders in the same cluster, accepting writes independently. In a system that requires a majority of nodes to proceed after a partition, the old leader would step down and stop accepting writes.
What happens in this situation in Kafka? Does it require majority vote to change the ISR set? If so, is there a brief data loss until the leader side detects the outages?
I haven't tested this, but I think the accepted answer is wrong and Lars Francke is correct about the possibility of brain-split.
Zookeeper quorum requires a majority, so if ZK ensemble partitions, at most one side will have a quorum.
Being a controller requires having an active session with ZK (ephemeral znode registration). If the current controller is partitioned away from ZK quorum, it should voluntarily stop considering itself a controller. This should take at most zookeeper.session.timeout.ms = 6000. Brokers still connected to ZK quorum should elect a new controller among themselves. (based on this: https://stackoverflow.com/a/52426734)
Being a topic-partition leader also requires an active session with ZK. Leader that lost a connection to ZK quorum should voluntarily stop being one. Elected controller will detect that some ex-leaders are missing and will assign new leaders from the ones in ISR and still connected to ZK quorum.
Now, what happens to producer requests received by the partitioned ex-leader during ZK timeout window? There are some possibilities.
If producer's acks = all and topic's min.insync.replicas = replication.factor, then all ISR should have exactly the same data. The ex-leader will eventually reject in-progress writes and producers will retry them. The newly elected leader will not have lost any data. On the other hand it won't be able to serve any write requests until the partition heals. It will be up to producers to decide to reject client requests or keep retrying in the background for a while.
Otherwise, it is very probable that the new leader will be missing up to zookeeper.session.timeout.ms + replica.lag.time.max.ms = 16000 worth of records and they will be truncated from the ex-leader after the partition heals.
Let's say you expect longer network partitions than you are comfortable with being read-only.
Something like this can work:
you have 3 availability zones and expect that at most 1 zone will be partitioned from the other 2
in each zone you have a Zookeeper node (or a few), so that 2 zones combined can always form a majority
in each zone you have a bunch of Kafka brokers
each topic has replication.factor = 3, one replica in each availability zone, min.insync.replicas = 2
producers' acks = all
This way there should be two Kafka ISRs on ZK quorum side of the network partition, at least one of them fully up to date with ex-leader. So no data loss on the brokers, and available for writes from any producers that are still able to connect to the winning side.
In a Kafka cluster, one of the brokers is elected to serve as the controller.
Among other things, the controller is responsible for electing new leaders. The Replica Management section covers this briefly: http://kafka.apache.org/documentation/#design_replicamanagment
Kafka uses Zookeeper to try to ensure there's only 1 controller at a time. However, the situation you described could still happen, spliting both the Zookeeper ensemble (assuming both sides can still have quorum) and the Kafka cluster in 2, resulting in 2 controllers.
In that case, Kafka has a number of configurations to limit the impact:
unclean.leader.election.enable: False by default, this is used to prevent replicas that were not in-sync to ever become leaders. If no available replicas are in-sync, Kafka marks the partition as offline, preventing data loss
replication.factor and min.insync.replicas: For example, if you set them to 3 and 2 respectively, in case of a "split-brain" you can prevent producers from sending records to the minority side if they use acks=all
See also KIP-101 for the details about handling logs that have diverged once the cluster is back together.

What happens to a partition leader whose connection to ZK breaks?

Does it stop acting as the leader (i.e. stop serving produce and fetch
requests) returning the "not a leader for partition" exception? Or
does it keep thinking it's the leader?
If it's the latter, any connected consumers that wait for new requests
on that replica will do so in vain. Since the cluster controller will
elect a new partition leader, this particular replica will become
stale.
I would expect this node to do the former, but I'd like to check to
make sure. (I understand it's an edge case, and maybe not a realistic
one at that, but still.)
According to the documentation, more specifically in the Distribution topic:
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader
for some of its partitions and a follower for others so load is well
balanced within the cluster.
Considering that a loss of connection is one of the many kinds of failure, I'd say that your first hypothesis is more likely to happen.

How KafKa guarantee Consistency and Availability?

I learned from http://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka
Our goal was to support replication in a Kafka cluster within a single datacenter, where network partitioning is rare
In the Distributed system, I think "Partitioning" is basic, so I don't know Kafka guarantee the availability without partition when only server node failed. Or I miss something
I think you may be confusing the sharding sense of "partitioning" with network partitions.
Kafka does indeed provide sharding and replication. Kafka elects a unique leader for each partition of each topic. All writes for a topic partition go through the leader. This is relevant to the documentation you cited indicating Kafka favor's availability over partition tolerance.
What is meant by network partitions is a break in communication between servers. Network communication failures are more rare in a LAN than in a WAN, so Kafka was architected to provide consistency except in cases where a network partition occurs. In the event of a network partition, Kafka's replicas may diverge from one another, with nodes on both sides of the partition potentially accepting writes. The reason this may occur is because when a network partition happens, nodes on each side of the partition can perceive nodes on the other side of the partition as having failed when in fact the link between them only failed. This means that each side of the network partition may elect a new leader for some topic partitions, therefore meaning that each side of the network partition can accept writes for some topic partitions. Once the network partition heals (the network is fixed), writes made on one side of the partition may overwrite writes made on the other side of the partition.