detail of leader retry when follower fail in raft - consensus

I have read the paper about raft, and I am confused with
If followers crash or run slowly, or if network packets are lost, the
leader retries AppendEntries RPCs indefinitely (even after it has
responded to the client) until all followers eventually store all log
entries.
which is written at the beginning of the section 5.3 Log replication.
To make my confusion more clear, I split it into two questions.
Question 1. Should leader retry in all of three failure situations below?
reply false if term < currentTerm (in Figure 2)
reply false if log doesn’t contain an entry at prevLogIndex whose term matches prevLogTerm (in Figure 2)
rpc error or timeout
Question 2. If leader should retry in some situation, does the leader process will block until all followers reply success?
Below is my attempt:
In first failure case, there is no need that leader should retry.
In second failure case, leader should retry, and adjust the nextIndex of the follower until the follower replies success. Also leader will be blocked before accepting next client request.
In third failure case, there is no need that leader should retry, and we can append the failure entry at next client request.

The quote only describes the leader will retry when the follower doesn't give a response for whatever reason (e.g. follower issue or network issue).
In scenarios 1 and 2 you listed under Question 1, followers do give a rejection response to the leader, so that's different from your initial confusion.
Now to answer what will happen in these scenarios:
If the leader's current term X < the follower's current term Y, the leader step down to follower mode because there must be another leader for term Y.
The follower rejects AppendEntries RPC because the follower doesn't contain logs at the prev index. In that case, the leader should force the follower to duplicate the leader's logs before appending new entries. The mechanism is described in Handling of inconsistency section.
The answer to Question 2 is very simple: The leader can proceed and update the state machine when the majority has returned success. Retrying on the remaining followers is for data consistency purposes, it doesn't slow down the system.

Related

zookeeper write failure handling

I have a question on how zookeeper handles write failures. Let us assume, there are 3 nodes, and write succeeds on 1 but fails on 2, I know zookeeper will return error. But what happens to successful write on one node? Is that rolled back or changes are persisted with an expectation of being replicated to other nodes eventually?
Zookeeper uses atomic messaging system. It's very nicely explained in the following article:
ZooKeeper uses a variation of two-phase-commit protocol for replicating transactions to followers. When a leader receive a change update from a client it generate a transaction with sequel number c and the leader’s epoch e and send the transaction to all followers. A follower adds the transaction to its history queue and send ACK to the leader. When a leader receives ACK’s from a quorum it send the the quorum COMMIT for that transaction. A follower that accept COMMIT will commit this transaction unless c is higher than any sequence number in its history queue. It will wait for receiving COMMIT’s for all its earlier transactions (outstanding transactions) before commiting.
Also official documentation can be very useful.

Kafka Real-Time guarantees

Can Kafka gurantee that a consumer sees the message x ms after it has been (successfully) produced?
Background:
I have a system, where service A accepts requests. Service B needs to be able to answer how many requests have been coming in by a certain time. Service B needs to be precise. My plan is:
Service A accepts requests, it produces a message and waits for the ack of at least one replica. As it got it, it will send the user that it's request is "in the system".
As Service B is asked, I wait x ms. Then I check the topic for new requests. So I know 100% the state of Service A at "now() - x ms".
In this case, Kafka needs to guarantee that I can consume a message maximum X ms after it has been produced. Is that the case?
In Kafka, messages are available for consumption once the high watermark has been incremented. This is guaranteed to happen after the minimum number of in sync replicas has been satisfied. The trade-off here is durability for latency. If you require lower latency, only acknowledging at the leader is faster. However, this isn't as durable as waiting for 2 in sync replicas. When answering "when can I consume?", you are really answering "when has the message been acknowledged as written by Kafka?"

How ZooKeeper provides sequential consistency

In here
someone said:
"even if you read from a different follower every time, you'll never
see version 3 of the data after seeing version 4."
So if I have 3 nodes zookeeper quorum as below:
zk0 -- leader
zk1
zk2
Assume there is a value in quorum "3" and I have a client connects to zk1, then my client sends a write request (update "3" to "4") and zk0 (leader) writes the value then subsequently received the confirmation from zk1. My client can see the new ("4"), because it connects to zk1.
Now my question is, if I switch my client from zk1 to zk2 (leader hasn't received write confirmation from zk2, so zk2 is behind the quorum) I will see the value as "3" rather than "4". Does it break the sequential consistency?
ZooKeeper uses a special atomic messaging protocol called ZooKeeper Atomic
Broadcast (ZAB), which ensures that the local replicas in the ensemble (groups of Zookeeper servers) never diverge.
ZAB protocol is atomic, so the protocol guarantees that updates either succeed or fail.
In Zookeeper every write goes through the leader and leader generates a transaction id (called zxid) and assigns it to this write request.
A zxid is a long (64-bit) integer split into two parts:
epoch
counter
The zxid represents the order in which the writes are applied on all replicas.
The epoch represents the changes in leadership over time. Epochs refer to the period during which a given server exercised leadership. During an epoch, a leader broadcasts proposals and identifies each one according to the counter.
A write is considered successful if the leader receives the ack from the majority.
The zxid is used to keep servers synchronized and avoid the conflict you described.

Kafka broker message loss scenario on leadership change

I am trying to understand the following behavior of message loss in Kafka. Briefly, when a broker dies early on and subsequently after some message processing, all other brokers die. If the broker that died first starts up, then it does not catch up with other brokers after they come up. Instead all the other brokers report errors and reset their offset to match the first broker. Is this behavior expected and what are the changes/settings to ensure zero message loss?
Kafka version: 2.11-0.10.2.0
Reproducible steps
Started 1 zookeeper instance and 3 kafka brokers
Created one topic with replication factor of 3 and partition of 3
Attached a kafka-console-consumer to topic
Used Kafka-console-producer to produce 2 messages
Killed two brokers (1&2)
Sent two messages
Killed last remaining broker (0)
Bring up broker (1) who had not seen the last two messages
Bring up broker (2) who had seen the last two messages and it shows an error
[2017-06-16 14:45:20,239] INFO Truncating log my-second-topic-1 to offset 1. (ka
fka.log.Log)
[2017-06-16 14:45:20,253] ERROR [ReplicaFetcherThread-0-1], Current offset 2 for
partition [my-second-topic,1] out of range; reset offset to 1 (kafka.server.Rep
licaFetcherThread)
Finally connect kafka-console-consumer and it sees two messages instead of the four that were published.
Response here : https://kafka.apache.org/documentation/#producerconfigs
The number of acknowledgments the producer requires the leader to have received before considering a request complete. This controls the durability of records that are sent. The following settings are allowed:
acks=0 If set to zero then the producer will not wait for any acknowledgment from the server at all. The record will be immediately added to the socket buffer and considered sent. No guarantee can be made that the server has received the record in this case, and the retries configuration will not take effect (as the client won't generally know of any failures). The offset given back for each record will always be set to -1.
acks=1 This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers. In this case should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost.
acks=all This means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee. This is equivalent to the acks=-1 setting.
By default acks=1 so set it to 'all' :
acks=all in your producer.properties file
Check if unclean.leader.election.enable = true and if so, set it to false so that only replicas that are insync can become the leader. If an out of sync replica is allowed to become leader then messages can get truncated and lost.

What happens to a partition leader whose connection to ZK breaks?

Does it stop acting as the leader (i.e. stop serving produce and fetch
requests) returning the "not a leader for partition" exception? Or
does it keep thinking it's the leader?
If it's the latter, any connected consumers that wait for new requests
on that replica will do so in vain. Since the cluster controller will
elect a new partition leader, this particular replica will become
stale.
I would expect this node to do the former, but I'd like to check to
make sure. (I understand it's an edge case, and maybe not a realistic
one at that, but still.)
According to the documentation, more specifically in the Distribution topic:
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader
for some of its partitions and a follower for others so load is well
balanced within the cluster.
Considering that a loss of connection is one of the many kinds of failure, I'd say that your first hypothesis is more likely to happen.