How to select a leader in RAFT consensus protocol - consensus

Suppose a network of 5 nodes uses RAFT consensus protocol. Each nodes maintains a transactions log which consists of a list of the log entries. Each log entry again consists of index and term. All of them are marked themselves as leader candidate and send the request Vote (term, index). The current log entries (i.e. list of term and index values) of all the leader candidate nodes as follows -
Then who will be the leader ?

(It's been long enough that I think this won't be used for homework anymore.)
Working through the problem.
Which nodes can each vote for?
candidate1: {1, 4, 5}
candidate2: {1, 2, 3, 4, 5}
candidate3: {1, 3, 4, 5}
candidate4: {4}
candidate5: {1, 4, 5}
Looking at the above, any of candidates 1, 4, 5 can have a majority of votes so any of those can become the leader.
candidate4 is not necessarily going to be the new leader because one of the other two may get the promised votes before it does.
As a practical application, imagine that candidate4 was the leader of term 3 and then died. Either candidate 4 or 5 will pick up the baton and lead term 4.

Related

Understanding use case for max.in.flight.request property in Kafka

I'm building a Spring Boot consumer-producers project with Kafka as middleman between two microservices. The theme of the project is a basketball game. Here is a small state machine diagram, in which events are displayed. There will be many more different events, this is just a snippet.
Start event:
{
"id" : 5,
"actualStartTime" : "someStartTime"
}
Point event:
{
"game": 5,
"type": "POINT",
"payload": {
"playerId": 44,
"value": 3
}
}
Assist event:
{
"game": 4,
"type": "ASSIST",
"payload": {
"playerId": 278,
"value": 1
}
}
Jump event:
{
"game": 2,
"type": "JUMP",
"payload": {
"playerId": 55,
"value": 1
}
}
End event:
{
"id" : 5,
"endTime" : "someStartTime"
}
Main thing to note here is that if there was an Assist event it must be followed with Point event.
Since I'm new to Kafka, I'll keep things simple and have one broker with one topic and one partition. For my use case I need to maintain ordering of each of these events as they actually happen live on the court (I have a json file with 7000 lines and bunch of these and other events).
So, let's say that from the Admin UI someone is sending these events (for instance via WebSockets) to the producers app. Producer app will be doing some simple validation or whatever it needs to do. Now, we can also image that we have two instances of producer app, one is at ip:8080 (prd1) and other one at ip:8081 (prd2).
In reality sequence of these three events happend: Assist -> Point -> Jump. The operator on the court send those three events in that order.
Assist event was sent on prd1 and Point was sent on prd2. Let's now imagine that there was a network glitch in communication between prd1 and Kafka cluster. Since we are using Kafka latest Kafka at the time of this writing, we already have enabled.idempotence=true and Assist event will not be sent twice.
During retry of Assist event on prd1 (towards Kafka), Point event on prd2 passed successfully. Then Assist event passed and after it Jump event (at any producer) also ended up in Kafka.
Now in queue we have: Point -> Assist -> Jump. This is not allowed.
My question is whether these types of problems should be handle by application's business logic (for example Spring State Machine) or this ordering can be handled by Kafka?
In case of latter, is property max.in.flight.request=1 responsible for ordering? Are there any other properties which might preserve ordering?
On the side note, is it a good tactic to use single partition for single match and multiple consumers for any of the partitions? Most probably I would be streaming different types of matches (basketball, soccer, golf, across different leagues and nations) and most of them will require some sort of ordering.
This maybe can be done with KStreams but I'm still on Kafka's steep learning curve.
Update 1 (after Jessica Vasey's comments):
Hi, thanks for very through comments. Unfortunately I didn't quite get all the pieces of the puzzle. What confuses me the most is some terminology you use and order of things happening. Not saying it's not correct, just I didn't understand.
I'll have two microservices, so two Producers. I got be be able to understand Kafka in microservices world, since I'm Java Spring developer and its all about microservices and multiple instances.
So let's say that on prd1 few dto events came along [Start -> Point -> Assist] and they are sent as a ProducerRequest (https://kafka.apache.org/documentation/#recordbatch), they are placed in RECORDS field. On the prd2 we got [Point -> Jump] also as a ProducerRequest. They are, in my understanding, two independent in-flight requests (out of 5 possible?)? Their ordering is based on a timestamp?
So when joining to the cluster, Kafka assigns id to producer let's say '0' for prd1 and '1' for prd2 (I guess it also depends on topic-partition they have been assigned). I don't understand whether each RecordBatch has its monotonically increasing sequence number id or each Kafka message within RecordBatch has its own monotonically increasing sequence number or both? Also the part 'time to recover' is bugging me. Like, if I got OutofOrderSequenceException, does it mean that [Point -> Jump] batch (with possibly other in-flight requsets and other batches in producer's buffer) will sit on Kafka until either delivery.timeout.ms expirees or when it finally successfully [Start -> Point -> Assist] is sent?
Sorry for confusing you further, it's some complex logic you have! Hopefully, I can clarify some points for you. I assumed you had one producer, but after re-reading your post I see you have two producers.
You cannot guarantee the order of messages across both producers. You can only guarantee the order for each individual producer. This post explains this quite nicely Kafka ordering with multiple producers on same topic and parititon
On this question:
They are, in my understanding, two independent in-flight requests (out
of 5 possible?)? Their ordering is based on a timestamp?
Yes, each producer will have max.in.flight.requests.per.connection set to 5.
You could provide a timestamp in your producer, which could help with your situation. However, I won't go into too much detail on that right now and will first answer your questions.
I don't understand whether
each RecordBatch has its monotonically increasing sequence number id
or each Kafka message within RecordBatch has its own monotonically
increasing sequence number or both? Also the part 'time to recover' is
bugging me. Like, if I got OutofOrderSequenceException, does it mean
that [Point -> Jump] batch (with possibly other in-flight requsets and
other batches in producer's buffer) will sit on Kafka until either
delivery.timeout.ms expirees or when it finally successfully [Start ->
Point -> Assist] is sent?
Each message is assigned a monotonically increasing sequence number. This LinkedIn post explains is better than I ever could!
Yes, other batches will sit on the producer until either the previous batch is acknowledged (which could be less than 2 mins) OR delivery.timeout.ms expires.
Even if max.in.flight.requests.per.connection > 1, setting enable.idempotence=true should preserve the message order as this assigns the messages a sequence number. When a batch fails, all subsequent batches to the same partition fail with OutofOrderSequenceException.
Number of partitions should be determined by your target throughput. If you wanted to send basketball matches to one partition and golf to another, you can use keys to determine which message should be sent where.

How is this configuration in Raft algorithm correct

So lets suppose that the log configuration for 3 servers in the raft algorithm is as follows:
S1 -> 3
S2 -> 3 3 4
S3 -> 3 3 5
This configuration can arise if let's say S3 is the leader in term 3 the entry was committed with every replica, then in another client operation with the same leader S3, it is only able to replicate the entry in S2 and itself and then it crashes. After that S2 wins the election with votes from itself and S1. It gets an entry and enters it to the log and then crashes. S3 comes back again and then gets vote from S1 and itself and becomes a leader, enters another log in term 5 and then crashes.
Now we have a situation in which entries in term 4 and 5 are definitely not committed. Lets say S2 becomes leader again(getting vote from itself and S1), It will try to correct the logs in the followers and would end up overwriting and appending to both followers to get:
S1 -> 3 3 4
S2 -> 3 3 4
S3 -> 3 3 4
In my reasoning it is fair to remove the log in term 5 because the leader might not have responded with a done message ever to the client as the replication of entry in term 5 was not done on majority of servers. But isn't the same argument valid for entry in term 4, and if so why is it replicated everywhere. The client wouldn't have got a done response for the entry in term 4 either so the client would think the state machine would not run this operation, but through the above logic it does.
Someone care to explain please?
"After that S2 wins the election with votes from itself and S1. It gets an entry and enters it to the log"
When a node becomes a leader, it is guaranteed that it has most up to date log; and the log has both proposed and committed entries.
Before the new leader processes requests, it pings every follower with an empty message; that message carries log information and the leader may conclude if some followers are behind. For those who are behind, the leader will ship all missing entries.
At that point the leader and the majority has same logs and the leader may continue with committing non committed entries; and then accept new requests.
Check the original paper: https://raft.github.io/raft.pdf page 4 - AppendEntries RPC: Receiver implementation.

How does raft preserve safty when a leader commits a log entry and crashes before informing followers this commitment?

In my understanding, a leader sends AppendEntries RPC to the followers, and if majority of followers return success, the leader will commit this entry. It will commit this entry by applying it to its own state machine, and it will also return to the client to let the client know that the command is successful.
However, at this time, this commitment is not known to the followers yet. It will inform the followers in the next AppendEntries (or heartbeat) RPC calls.
In the simplest case, if the leader crashes after the commitment and before the next AppendEntries, raft will use the "only most up to date follower can win" strategy to ensure that the next leader must contain this log entry (although not committed), and the new leader will commit this entry and send AppendEntries to other followers. In this way, the log entry is safely kept.
However, consider the following complicated scenario (extracted from PHD thesis "CONSENSUS: BRIDGING THEORY AND PRACTICE" page 23).
At this point, the log entry from term 2 has been replicated on a
majority of the servers, but it is not committed. If S1 crashes as in
(d1), S5 could be elected leader (with votes from S2, S3, and S4) and
overwrite the entry with its own entry from term 3.
How if at this point, it is committed in Server S1, but not committed in other servers yet? If S1 then crashes as in (d1), this log entry will be overwritten by S5?
In my understanding, a committed entry (applied to state machine and possibly informed the client about the result) shall never be overwritten?
Did I misunderstand anything of the raft protocol?
Thanks.
There are more conditions in Raft to commit an entry.
On page 4 of this paper (The 1-page summary of raft) it says
Leaders:
...
If there exists an N such that N > commitIndex, a majority of matchIndex[i] ≥ N, and log[N].term == currentTerm set commitIndex = N (§5.3, §5.4).
In other words, not only does the entry have to be replicated to a majority, its term has to be from the current term. This is why in a practical system a new leader will propose a no-op so that it can advance this commitIndex.
So now we know the leader won't commit until then, but what if it commits but doesn't send the commit.
Later in section 5.4.1 of the same paper it says (emphasis mine):
Raft...guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election....Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster
in order to be elected, which means that every committed entry must be present in at least one of those servers.
In short, the new leader by definition must have the entries that the old leader considered committed.

What happens to uncommitted previous term log entries in Raft algorithm?

There are a number of questions here on StackOverflow around Figure 8, discussed in section 5.4.2 in the original Raft paper:
Figure 8
What has not been made clear by the paper and by none of the answers is the exact fate of that problematic entry (2, 3). My question is two-fold:
What exactly happens to entry at index 2 during term 3 (2, 3), made by S5? The figure mentions that S5 will not become a leader because majority will reject its RequestVotes. Does that mean that upon receiving AppendEntries RPC, S5 will then overwrite its entry (2, 3) with (2, 2) and (3, 4) as per current leader in (e)?
If S5 is forced to overwrite this entry, and it is never committed, what response should the client that has sent (1, 3) receive? Do the clients receive acknowledgements for uncommitted entries as if they were already applied to a state-machine?
The figure mentions that S5 will not become a leader because majority
will reject its RequestVotes
As in (e) in raft paper, S5 will not become a leader because the logs of S5 is not at least up to date with the logs of majority (S1,S2,S3)
Does that mean that upon receiving AppendEntries RPC, S5 will then
overwrite its entry (2, 3) with (2, 2) and (3, 4) as per current
leader in (e)?
Yes, the logs of S5 will be overwritten by the logs of the current leader. Quoted from raft paper:
If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any).
Do the clients receive acknowledgements for uncommitted entries as if
they were already applied to a state-machine?
No, the clients only receive acknowledgements for committed entries when the entry has been safely replicated. Please see a quote from raft paper:
When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that
execution to the client.
There is also a case when the leader has replicated the log entry but crashes before responding to client or the response is lost when being sent over the network, the client needs to retry causing the command to be executed multiple times. Quouted from raft paper:
However, as described so far Raft can execute a command multiple times: for example, if the leader crashes after committing the log entry but before responding to the client, the client will retry the command with a new leader, causing it to be executed a second time. The solution is for clients to assign unique serial numbers to every command. Then, the state machine tracks the latest
serial number processed for each client, along with the associated response. If it receives a command whose serial number has already been executed, it responds immediately without re-executing the request

how raft follower rejoin after network disconnected?

I have a problem on raft.
In paper "In Search of an Understandable Consensus Algorithm(Extended Version)" it says:
To begin an election, a follower increments its current
term and transitions to candidate state. (in section 5.2)
and it also says:
reciever should be "Reply false if args.term < currentTerm" in AppendEntries RPC and RequestVot RPC
so, let's think this scene, there are 5 machine in raft system, and now machine 0 is leader, machine 1 to 4 is follower, now is term 1. Suddenly, machine 1 is disconnected network, and then machine 1 is timeout, and it begin leader election, it send RequestVot RPC, sure it will be failed(network is disconnected). and then it will begin new leader election.......and so on. machine 1's term is increasement many times. Maybe increase to 10. when machine 1'Term is increased to 10, it connected network. and leader(machine 0) send heartbeat to machine 1, and machine 1 will REJECT the heartbeat(machine 0'term is less than machine 1), and now, machine 1 will not able to rejoin the system.
The important thing to remember here is when a node receives a greater term it always updates its local term. So, since machine 1 will reject the leader's request, the leader will ultimately learn about the higher term (10) and step down, then a new node will be elected for term >10.
Obviously this is inefficient, but it's why most real world implementations use the so called "pre-vote" protocol, checking to ensure a node can win an election before it transitions to the candidate role and increments the term.