How ZooKeeper provides sequential consistency - apache-zookeeper

In here
someone said:
"even if you read from a different follower every time, you'll never
see version 3 of the data after seeing version 4."
So if I have 3 nodes zookeeper quorum as below:
zk0 -- leader
zk1
zk2
Assume there is a value in quorum "3" and I have a client connects to zk1, then my client sends a write request (update "3" to "4") and zk0 (leader) writes the value then subsequently received the confirmation from zk1. My client can see the new ("4"), because it connects to zk1.
Now my question is, if I switch my client from zk1 to zk2 (leader hasn't received write confirmation from zk2, so zk2 is behind the quorum) I will see the value as "3" rather than "4". Does it break the sequential consistency?

ZooKeeper uses a special atomic messaging protocol called ZooKeeper Atomic
Broadcast (ZAB), which ensures that the local replicas in the ensemble (groups of Zookeeper servers) never diverge.
ZAB protocol is atomic, so the protocol guarantees that updates either succeed or fail.
In Zookeeper every write goes through the leader and leader generates a transaction id (called zxid) and assigns it to this write request.
A zxid is a long (64-bit) integer split into two parts:
epoch
counter
The zxid represents the order in which the writes are applied on all replicas.
The epoch represents the changes in leadership over time. Epochs refer to the period during which a given server exercised leadership. During an epoch, a leader broadcasts proposals and identifies each one according to the counter.
A write is considered successful if the leader receives the ack from the majority.
The zxid is used to keep servers synchronized and avoid the conflict you described.

Related

If the partition a Kafka producer try to send messages to went offline, can the producer try to send to a different partition?

My Kafka cluster has 5 brokers and the replication factor is 3 for topics. At some time some partitions went offline but eventually they went back online. My questions are:
How many brokers were down does it indicate, given the fact that there were offline partitions? I think given the cluster setup above, I can afford to lose 2 brokers at the same time. However, if there were 2 brokers down, for some partitions they no longer have quorum; will these partitions go offline in this case?
If there are offline partitions, and a Kafka producer tries to send messages to them and fails, will the producer try a different partition that may be online? The messages have no key in them.
Not sure if I understood your question completely right but I have the impression that you are mixing up partitions and replications. Or at least, your question cannot be looked at isolated on the producer. As soon as one broker is down some things will happen on the cluster.
Each TopicPartition has one Partition Leader and your clients (e.g. Producer and Consumer) are only communicating with this one leader, independen of the number of replications.
In the case where two out of five broker are not available, Kafka will move the partition leader as well as the replicas to a healthy broker. In that scenario you should therefore not get into trouble although it might take some time and retries for the new leader to be selected and the new replications to be created on the healthy broker. A leader selection can be made fast as you have set the replication factor to three, so even if two brokers go down, one broker should still have the complete data (assuming all partitions were in-sync). However, creating two new replicas could take some time dependent on the amount of data. For that scenario you need to look into the topic level configuration min.insync.replicas and the KafkaProducer confiruation acks (see below).
I think the following are the most important configurations for your KafkaProducer to handle such situation:
bootstrap.servers: If you are anticipating regular connection problems with your brokers, you should ensure that you list all five of them. Although it is sufficient to only mention one address (as one broker will then communicate will all other brokers in the cluster) it is safe to have them all listed in case one or even two broker are not available.
acks: This defaults to 1 and defines the number of acknowledgments the producer requires the partition leader to have received before considering a request as successful. Possible values are 0, 1 and all.
retries: This value defaults to 2147483647 and will cause the client to resend any record whose send fails with a potentially transient error until the time of delivery.timeout.ms is reached
delivery.timeout.ms: An upper bound on the time to report success or failure after a call to send() returns. This limits the total time that a record will be delayed prior to sending, the time to await acknowledgement from the broker (if expected), and the time allowed for retriable send failures. The producer may report failure to send a record earlier than this config if either an unrecoverable error is encountered, the retries have been exhausted, or the record is added to a batch which reached an earlier delivery expiration deadline. The value of this config should be greater than or equal to the sum of request.timeout.ms and linger.ms.
You will find more details on the documentation on the Producer configs.

A kafka record is acknowledged but no data returned to consumer

There is a Kafka (version 2.2.0) cluster of 3 nodes. One node becomes artificially unavailable (network disconnection). Then we have the following behaviour:
We send a record to a producer with the given topic-partition (to the specific partition, let's say #0).
We receive a record metadata from the producer what confirms that it has been acknowledged.
Immediately after that we poll a consumer assigned to the same topic-partition and an offset taken from the record's metadata. The poll timeout was set to 30 seconds. No data is returned (an empty set is returned).
This happens inconsistently from time to time (under described circumstances with one Kafka node failure).
Essentially my question is: should data be immediately available for consumers ones it is acknowledged? What the reasonable timeout for that if not?
UPD: some configuration details:
number of partitions for the topic: 1
default replication factor: 3
sync replication factor: 2
acks for producer: all
The default setting of acks on the producer is 1. This means the producer waits for the acknowledgement from the leader replica only. If the leader dies right after acknowledging, the message won't be delivered.
Should data be immediately available for consumers? Yes, in general there should be very little lag per default, should be effectively on the milliseconds range per default and without load.
If you want to make sure that a message can't be lost, you have to configure the producer to "acks=all" in addition to min.insync.replicas=2. This will make sure all in sync replicas acknowledge the message, and that minimum 2 nodes do. So you are still allowed to lose one node and be fine. Lose 2 nodes and you won't be able to send, but even then messages won't be lost.

zookeeper write failure handling

I have a question on how zookeeper handles write failures. Let us assume, there are 3 nodes, and write succeeds on 1 but fails on 2, I know zookeeper will return error. But what happens to successful write on one node? Is that rolled back or changes are persisted with an expectation of being replicated to other nodes eventually?
Zookeeper uses atomic messaging system. It's very nicely explained in the following article:
ZooKeeper uses a variation of two-phase-commit protocol for replicating transactions to followers. When a leader receive a change update from a client it generate a transaction with sequel number c and the leader’s epoch e and send the transaction to all followers. A follower adds the transaction to its history queue and send ACK to the leader. When a leader receives ACK’s from a quorum it send the the quorum COMMIT for that transaction. A follower that accept COMMIT will commit this transaction unless c is higher than any sequence number in its history queue. It will wait for receiving COMMIT’s for all its earlier transactions (outstanding transactions) before commiting.
Also official documentation can be very useful.

How does kafka handle network partitions?

Kafka has the concept of a in-sync replica set, which is the set of nodes that aren't too far behind the leader.
What happens if the network cleanly partitions so that a minority containing the leader is on one side, and a majority containing the other in-sync nodes on the other side?
The minority/leader-side presumably thinks that it lost a bunch of nodes, reduces the ISR size accordingly, and happily carries on.
The other side probably thinks that it lost the leader, so it elects a new one and happily carries on.
Now we have two leaders in the same cluster, accepting writes independently. In a system that requires a majority of nodes to proceed after a partition, the old leader would step down and stop accepting writes.
What happens in this situation in Kafka? Does it require majority vote to change the ISR set? If so, is there a brief data loss until the leader side detects the outages?
I haven't tested this, but I think the accepted answer is wrong and Lars Francke is correct about the possibility of brain-split.
Zookeeper quorum requires a majority, so if ZK ensemble partitions, at most one side will have a quorum.
Being a controller requires having an active session with ZK (ephemeral znode registration). If the current controller is partitioned away from ZK quorum, it should voluntarily stop considering itself a controller. This should take at most zookeeper.session.timeout.ms = 6000. Brokers still connected to ZK quorum should elect a new controller among themselves. (based on this: https://stackoverflow.com/a/52426734)
Being a topic-partition leader also requires an active session with ZK. Leader that lost a connection to ZK quorum should voluntarily stop being one. Elected controller will detect that some ex-leaders are missing and will assign new leaders from the ones in ISR and still connected to ZK quorum.
Now, what happens to producer requests received by the partitioned ex-leader during ZK timeout window? There are some possibilities.
If producer's acks = all and topic's min.insync.replicas = replication.factor, then all ISR should have exactly the same data. The ex-leader will eventually reject in-progress writes and producers will retry them. The newly elected leader will not have lost any data. On the other hand it won't be able to serve any write requests until the partition heals. It will be up to producers to decide to reject client requests or keep retrying in the background for a while.
Otherwise, it is very probable that the new leader will be missing up to zookeeper.session.timeout.ms + replica.lag.time.max.ms = 16000 worth of records and they will be truncated from the ex-leader after the partition heals.
Let's say you expect longer network partitions than you are comfortable with being read-only.
Something like this can work:
you have 3 availability zones and expect that at most 1 zone will be partitioned from the other 2
in each zone you have a Zookeeper node (or a few), so that 2 zones combined can always form a majority
in each zone you have a bunch of Kafka brokers
each topic has replication.factor = 3, one replica in each availability zone, min.insync.replicas = 2
producers' acks = all
This way there should be two Kafka ISRs on ZK quorum side of the network partition, at least one of them fully up to date with ex-leader. So no data loss on the brokers, and available for writes from any producers that are still able to connect to the winning side.
In a Kafka cluster, one of the brokers is elected to serve as the controller.
Among other things, the controller is responsible for electing new leaders. The Replica Management section covers this briefly: http://kafka.apache.org/documentation/#design_replicamanagment
Kafka uses Zookeeper to try to ensure there's only 1 controller at a time. However, the situation you described could still happen, spliting both the Zookeeper ensemble (assuming both sides can still have quorum) and the Kafka cluster in 2, resulting in 2 controllers.
In that case, Kafka has a number of configurations to limit the impact:
unclean.leader.election.enable: False by default, this is used to prevent replicas that were not in-sync to ever become leaders. If no available replicas are in-sync, Kafka marks the partition as offline, preventing data loss
replication.factor and min.insync.replicas: For example, if you set them to 3 and 2 respectively, in case of a "split-brain" you can prevent producers from sending records to the minority side if they use acks=all
See also KIP-101 for the details about handling logs that have diverged once the cluster is back together.

What happens to a partition leader whose connection to ZK breaks?

Does it stop acting as the leader (i.e. stop serving produce and fetch
requests) returning the "not a leader for partition" exception? Or
does it keep thinking it's the leader?
If it's the latter, any connected consumers that wait for new requests
on that replica will do so in vain. Since the cluster controller will
elect a new partition leader, this particular replica will become
stale.
I would expect this node to do the former, but I'd like to check to
make sure. (I understand it's an edge case, and maybe not a realistic
one at that, but still.)
According to the documentation, more specifically in the Distribution topic:
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader
for some of its partitions and a follower for others so load is well
balanced within the cluster.
Considering that a loss of connection is one of the many kinds of failure, I'd say that your first hypothesis is more likely to happen.